All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Andres Freund <andres@anarazel.de>
Cc: io-uring@vger.kernel.org
Subject: Re: [PATCH 3/6] io_uring: support buffer selection
Date: Tue, 10 Mar 2020 07:37:44 -0600	[thread overview]
Message-ID: <82a0f0f9-977a-c131-e77a-289f3db4d48d@kernel.dk> (raw)
In-Reply-To: <20200309172101.lgswlmmt763jtuno@alap3.anarazel.de>

On 3/9/20 11:21 AM, Andres Freund wrote:
> Hi,
> 
> On 2020-02-28 13:30:50 -0700, Jens Axboe wrote:
>> If a server process has tons of pending socket connections, generally
>> it uses epoll to wait for activity. When the socket is ready for reading
>> (or writing), the task can select a buffer and issue a recv/send on the
>> given fd.
>>
>> Now that we have fast (non-async thread) support, a task can have tons
>> of pending reads or writes pending. But that means they need buffers to
>> back that data, and if the number of connections is high enough, having
>> them preallocated for all possible connections is unfeasible.
>>
>> With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
>> use for any request. The request then sets IOSQE_BUFFER_SELECT in the
>> sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
>> a free buffer from the specified group is selected. If none are
>> available, the request is terminated with -ENOBUFS. If successful, the
>> CQE on completion will contain the buffer ID chosen in the cqe->flags
>> member, encoded as:
>>
>> 	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
>>
>> Once a buffer has been consumed by a request, it is no longer available
>> and must be registered again with IORING_OP_PROVIDE_BUFFERS.
>>
>> Requests need to support this feature. For now, IORING_OP_READ and
>> IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
>> res == -EINVAL will be posted if attempted on unsupported requests.
> 
> Why not EOPNOTSUPP or such? Makes it more feasible for applications to
> handle the case separately.

Good point, I can make that change.

>> +static int io_rw_common_cflags(struct io_kiocb *req)
>> +{
>> +	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +	int cflags;
>> +
>> +	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	cflags |= IORING_CQE_F_BUFFER;
>> +	req->rw.addr = 0;
>> +	kfree(kbuf);
>> +	return cflags;
>> +}
> 
>>  		if (refcount_dec_and_test(&req->refs) &&
>> @@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req)
>>  static void io_complete_rw_common(struct kiocb *kiocb, long res)
>>  {
>>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
>> +	int cflags = 0;
>>  
>>  	if (kiocb->ki_flags & IOCB_WRITE)
>>  		kiocb_end_write(req);
>>  
>>  	if (res != req->result)
>>  		req_set_fail_links(req);
>> -	io_cqring_add_event(req, res);
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		cflags = io_rw_common_cflags(req);
>> +	__io_cqring_add_event(req, res, cflags);
>>  }
> 
> Besides the naming already commented upon by Pavel, I'm also wondering
> if it's the right thing to call this unconditionally from
> io_complete_*rw*_common() - hard to see how this feature would ever be
> used in the write path...

Doesn't really matter I think, I'd rather have that dead branch for
writes than needing a separate handler. I did change the naming, this
posting is almost two weeks old. I'll change the other little bits from
here and post a new series so we're all on the same page.

>> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
>> +					  struct io_buffer *kbuf,
>> +					  bool needs_lock)
>> +{
>> +	struct list_head *list;
>> +
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		return kbuf;
>> +
>> +	/*
>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>> +	 * The only exception is when we've detached the request and issue it
>> +	 * from an async worker thread, grab the lock for that case.
>> +	 */
>> +	if (needs_lock)
>> +		mutex_lock(&req->ctx->uring_lock);
>> +
>> +	lockdep_assert_held(&req->ctx->uring_lock);
> 
> This comment is in a few places, perhaps there's a way to unify by
> placing the conditional acquisition into a helper?

We could have a io_lock_ring(ctx, force_nonblock) helper and just put it
in there, ditto for the unlock.

>> +	list = idr_find(&req->ctx->io_buffer_idr, gid);
>> +	if (list && !list_empty(list)) {
>> +		kbuf = list_first_entry(list, struct io_buffer, list);
>> +		list_del(&kbuf->list);
>> +	} else {
>> +		kbuf = ERR_PTR(-ENOBUFS);
>> +	}
>> +
>> +	if (needs_lock)
>> +		mutex_unlock(&req->ctx->uring_lock);
>> +
>> +	return kbuf;
>> +}
>> +
>>  static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>> -			       struct iovec **iovec, struct iov_iter *iter)
>> +			       struct iovec **iovec, struct iov_iter *iter,
>> +			       bool needs_lock)
>>  {
>>  	void __user *buf = u64_to_user_ptr(req->rw.addr);
>>  	size_t sqe_len = req->rw.len;
>> @@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>  		return io_import_fixed(req, rw, iter);
>>  	}
>>  
>> -	/* buffer index only valid with fixed read/write */
>> -	if (req->rw.kiocb.private)
>> +	/* buffer index only valid with fixed read/write, or buffer select  */
>> +	if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT))
>>  		return -EINVAL;
>>  
>>  	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
>>  		ssize_t ret;
>> +
>> +		if (req->flags & REQ_F_BUFFER_SELECT) {
>> +			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +			int gid;
>> +
>> +			gid = (int) (unsigned long) req->rw.kiocb.private;
>> +			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
>> +			if (IS_ERR(kbuf)) {
>> +				*iovec = NULL;
>> +				return PTR_ERR(kbuf);
>> +			}
>> +			req->rw.addr = (u64) kbuf;
>> +			if (sqe_len > kbuf->len)
>> +				sqe_len = kbuf->len;
>> +			req->flags |= REQ_F_BUFFER_SELECTED;
>> +			buf = u64_to_user_ptr(kbuf->addr);
>> +		}
> 
> Feels a bit dangerous to have addr sometimes pointing to the user
> specified data, and sometimes to kernel data. Even if indicated by
> REQ_F_BUFFER_SELECTED.

It's not ideal, but it's either that or have the struct io_rw blow over
the cacheline, which I don't want. So the tradeoff seemed like the right
one to me. All the initial io_kiocb per-request-type unions are 64b or
less.

>> +static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
>> +					       int *cflags, bool needs_lock)
>> +{
>> +	struct io_sr_msg *sr = &req->sr_msg;
>> +	struct io_buffer *kbuf;
>> +
>> +	if (!(req->flags & REQ_F_BUFFER_SELECT))
>> +		return NULL;
>> +
>> +	kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock);
>> +	if (IS_ERR(kbuf))
>> +		return kbuf;
>> +
>> +	sr->kbuf = kbuf;
>> +	if (sr->len > kbuf->len)
>> +		sr->len = kbuf->len;
>> +	req->flags |= REQ_F_BUFFER_SELECTED;
>> +
>> +	*cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	*cflags |= IORING_CQE_F_BUFFER;
>> +	return kbuf;
>> +}
> 
> Could more of this be moved into io_buffer_select? Looks like every
> REQ_F_BUFFER_SELECT supporting op is going to need most of it?

Probably could be, I'll take a look and see if we can move more of that
logic in there.

>>  static int io_recvmsg_prep(struct io_kiocb *req,
>>  			   const struct io_uring_sqe *sqe)
>>  {
> 
> Looks like this would be unused if !defined(CONFIG_NET)?

Also fixed a week ago or so.

-- 
Jens Axboe


  reply	other threads:[~2020-03-10 13:37 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-28 20:30 [PATCHSET v3] io_uring support for automatic buffers Jens Axboe
2020-02-28 20:30 ` [PATCH 1/6] io_uring: buffer registration infrastructure Jens Axboe
2020-02-28 20:30 ` [PATCH 2/6] io_uring: add IORING_OP_PROVIDE_BUFFERS Jens Axboe
2020-02-29  0:43   ` Pavel Begunkov
2020-02-29  4:50     ` Jens Axboe
2020-02-29 11:36       ` Pavel Begunkov
2020-02-29 17:32         ` Jens Axboe
2020-02-29 12:08   ` Pavel Begunkov
2020-02-29 17:34     ` Jens Axboe
2020-02-29 18:11       ` Jens Axboe
2020-03-09 17:03   ` Andres Freund
2020-03-09 17:17     ` Jens Axboe
2020-03-09 17:28       ` Andres Freund
2020-03-10 13:33         ` Jens Axboe
2020-02-28 20:30 ` [PATCH 3/6] io_uring: support buffer selection Jens Axboe
2020-02-29 12:21   ` Pavel Begunkov
2020-02-29 17:35     ` Jens Axboe
2020-03-09 17:21   ` Andres Freund
2020-03-10 13:37     ` Jens Axboe [this message]
2020-02-28 20:30 ` [PATCH 4/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV Jens Axboe
2020-02-28 20:30 ` [PATCH 5/6] net: abstract out normal and compat msghdr import Jens Axboe
2020-02-28 20:30 ` [PATCH 6/6] io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82a0f0f9-977a-c131-e77a-289f3db4d48d@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=andres@anarazel.de \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.