Linux-NVME Archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V2 0/2] *** use rdma device capability to limit queue size  ***
@ 2023-12-19  7:32 Guixin Liu
  2023-12-19  7:32 ` [RFC PATCH V2 1/2] nvmet: rdma: utilize ib_device capability for setting max_queue_size Guixin Liu
  2023-12-19  7:32 ` [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize Guixin Liu
  0 siblings, 2 replies; 11+ messages in thread
From: Guixin Liu @ 2023-12-19  7:32 UTC (permalink / raw
  To: hch, sagi, kbusch, kch, axboe; +Cc: linux-nvme

Hi guys,

Currently, the queue size if nvme over rdma is limited with a depth
of 128, we can use rdma device capability to limit it.

Changes from v1 to v2:
- use nvmet_ctrl's port to find rdma's nvmet_rdma_queue without change
  get_max_queue_size's param.
- additional clarification that the queue depth will be controlled within
  1024, rather than being very large.

Guixin Liu (2):
  nvmet: rdma: utilize ib_device capability for setting max_queue_size
  nvme: rdma: use ib_device's max_qp_wr to limit sqsize

 drivers/nvme/host/rdma.c   | 14 ++++++++------
 drivers/nvme/target/rdma.c | 17 ++++++++++++++++-
 include/linux/nvme-rdma.h  |  2 +-
 3 files changed, 25 insertions(+), 8 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH V2 1/2] nvmet: rdma: utilize ib_device capability for setting max_queue_size
  2023-12-19  7:32 [RFC PATCH V2 0/2] *** use rdma device capability to limit queue size *** Guixin Liu
@ 2023-12-19  7:32 ` Guixin Liu
  2023-12-19  7:32 ` [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize Guixin Liu
  1 sibling, 0 replies; 11+ messages in thread
From: Guixin Liu @ 2023-12-19  7:32 UTC (permalink / raw
  To: hch, sagi, kbusch, kch, axboe; +Cc: linux-nvme

Respond with the smaller value between 1024 and the ib_device's
max_qp_wr as the RDMA max queue size.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 drivers/nvme/target/rdma.c | 17 ++++++++++++++++-
 include/linux/nvme-rdma.h  |  2 ++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 4597bca..d546203 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -2002,7 +2002,22 @@ static u8 nvmet_rdma_get_mdts(const struct nvmet_ctrl *ctrl)
 
 static u16 nvmet_rdma_get_max_queue_size(const struct nvmet_ctrl *ctrl)
 {
-	return NVME_RDMA_MAX_QUEUE_SIZE;
+	struct nvmet_rdma_queue *queue;
+	struct nvmet_port *nport = ctrl->port;
+	int max_qp_wr = 0;
+
+	/* Here nvmet_ctrl's port is already set. */
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
+		if (queue->port == nport) {
+			max_qp_wr = queue->dev->device->attrs.max_qp_wr;
+			break;
+		}
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	return (u16)min_t(int, NVMET_QUEUE_SIZE,
+			  max_qp_wr / (NVME_RDMA_SEND_WR_FACTOR + 1));
 }
 
 static const struct nvmet_fabrics_ops nvmet_rdma_ops = {
diff --git a/include/linux/nvme-rdma.h b/include/linux/nvme-rdma.h
index 4dd7e6f..c19858b 100644
--- a/include/linux/nvme-rdma.h
+++ b/include/linux/nvme-rdma.h
@@ -8,6 +8,8 @@
 
 #define NVME_RDMA_MAX_QUEUE_SIZE	128
 
+#define NVME_RDMA_SEND_WR_FACTOR 3  /* MR, SEND, INV */
+
 enum nvme_rdma_cm_fmt {
 	NVME_RDMA_CM_FMT_1_0 = 0x0,
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-19  7:32 [RFC PATCH V2 0/2] *** use rdma device capability to limit queue size *** Guixin Liu
  2023-12-19  7:32 ` [RFC PATCH V2 1/2] nvmet: rdma: utilize ib_device capability for setting max_queue_size Guixin Liu
@ 2023-12-19  7:32 ` Guixin Liu
  2023-12-20  9:17   ` Sagi Grimberg
  1 sibling, 1 reply; 11+ messages in thread
From: Guixin Liu @ 2023-12-19  7:32 UTC (permalink / raw
  To: hch, sagi, kbusch, kch, axboe; +Cc: linux-nvme

Currently, the host is limited to creating queues with a depth of
128. To enable larger queue sizes, constrain the sqsize based on
the ib_device's max_qp_wr capability.
In addition, the queue size is restricted between 16 and 1024 in
nvmf_parse_options(), so the final value of queue depth will not
biger than 1024.

And also remove unused NVME_RDMA_MAX_QUEUE_SIZE macro.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 drivers/nvme/host/rdma.c  | 14 ++++++++------
 include/linux/nvme-rdma.h |  2 --
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 81e2621..982f3e4 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -489,8 +489,7 @@ static int nvme_rdma_create_cq(struct ib_device *ibdev,
 static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 {
 	struct ib_device *ibdev;
-	const int send_wr_factor = 3;			/* MR, SEND, INV */
-	const int cq_factor = send_wr_factor + 1;	/* + RECV */
+	const int cq_factor = NVME_RDMA_SEND_WR_FACTOR + 1;	/* + RECV */
 	int ret, pages_per_mr;
 
 	queue->device = nvme_rdma_find_get_device(queue->cm_id);
@@ -508,7 +507,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
 	if (ret)
 		goto out_put_dev;
 
-	ret = nvme_rdma_create_qp(queue, send_wr_factor);
+	ret = nvme_rdma_create_qp(queue, NVME_RDMA_SEND_WR_FACTOR);
 	if (ret)
 		goto out_destroy_ib_cq;
 
@@ -1006,6 +1005,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 {
 	int ret;
 	bool changed;
+	int ib_max_qsize;
 
 	ret = nvme_rdma_configure_admin_queue(ctrl, new);
 	if (ret)
@@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 			ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
 	}
 
-	if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
+	ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
+			(NVME_RDMA_SEND_WR_FACTOR + 1);
+	if (ctrl->ctrl.sqsize + 1 > ib_max_qsize) {
 		dev_warn(ctrl->ctrl.device,
 			"ctrl sqsize %u > max queue size %u, clamping down\n",
-			ctrl->ctrl.sqsize + 1, NVME_RDMA_MAX_QUEUE_SIZE);
-		ctrl->ctrl.sqsize = NVME_RDMA_MAX_QUEUE_SIZE - 1;
+			ctrl->ctrl.sqsize + 1, ib_max_qsize);
+		ctrl->ctrl.sqsize = ib_max_qsize - 1;
 	}
 
 	if (ctrl->ctrl.sqsize + 1 > ctrl->ctrl.maxcmd) {
diff --git a/include/linux/nvme-rdma.h b/include/linux/nvme-rdma.h
index c19858b..67ee770 100644
--- a/include/linux/nvme-rdma.h
+++ b/include/linux/nvme-rdma.h
@@ -6,8 +6,6 @@
 #ifndef _LINUX_NVME_RDMA_H
 #define _LINUX_NVME_RDMA_H
 
-#define NVME_RDMA_MAX_QUEUE_SIZE	128
-
 #define NVME_RDMA_SEND_WR_FACTOR 3  /* MR, SEND, INV */
 
 enum nvme_rdma_cm_fmt {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-19  7:32 ` [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize Guixin Liu
@ 2023-12-20  9:17   ` Sagi Grimberg
  2023-12-20 10:52     ` Max Gurtovoy
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-12-20  9:17 UTC (permalink / raw
  To: Guixin Liu, hch, kbusch, kch, axboe; +Cc: linux-nvme



On 12/19/23 09:32, Guixin Liu wrote:
> Currently, the host is limited to creating queues with a depth of
> 128. To enable larger queue sizes, constrain the sqsize based on
> the ib_device's max_qp_wr capability.
> In addition, the queue size is restricted between 16 and 1024 in
> nvmf_parse_options(), so the final value of queue depth will not
> biger than 1024.
> 
> And also remove unused NVME_RDMA_MAX_QUEUE_SIZE macro.
> 
> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
> ---
>   drivers/nvme/host/rdma.c  | 14 ++++++++------
>   include/linux/nvme-rdma.h |  2 --
>   2 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 81e2621..982f3e4 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -489,8 +489,7 @@ static int nvme_rdma_create_cq(struct ib_device *ibdev,
>   static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>   {
>   	struct ib_device *ibdev;
> -	const int send_wr_factor = 3;			/* MR, SEND, INV */
> -	const int cq_factor = send_wr_factor + 1;	/* + RECV */
> +	const int cq_factor = NVME_RDMA_SEND_WR_FACTOR + 1;	/* + RECV */
>   	int ret, pages_per_mr;
>   
>   	queue->device = nvme_rdma_find_get_device(queue->cm_id);
> @@ -508,7 +507,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>   	if (ret)
>   		goto out_put_dev;
>   
> -	ret = nvme_rdma_create_qp(queue, send_wr_factor);
> +	ret = nvme_rdma_create_qp(queue, NVME_RDMA_SEND_WR_FACTOR);
>   	if (ret)
>   		goto out_destroy_ib_cq;
>   
> @@ -1006,6 +1005,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
>   {
>   	int ret;
>   	bool changed;
> +	int ib_max_qsize;
>   
>   	ret = nvme_rdma_configure_admin_queue(ctrl, new);
>   	if (ret)
> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
>   			ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>   	}
>   
> -	if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
> +	ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
> +			(NVME_RDMA_SEND_WR_FACTOR + 1);

rdma_dev_max_qsize is a better name.

Also, you can drop the RFC for the next submission.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-20  9:17   ` Sagi Grimberg
@ 2023-12-20 10:52     ` Max Gurtovoy
  2023-12-20 19:27       ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Max Gurtovoy @ 2023-12-20 10:52 UTC (permalink / raw
  To: Sagi Grimberg, Guixin Liu, hch, kbusch, kch, axboe; +Cc: linux-nvme


Hi Guixin Liu,

On 20/12/2023 11:17, Sagi Grimberg wrote:
> 
> 
> On 12/19/23 09:32, Guixin Liu wrote:
>> Currently, the host is limited to creating queues with a depth of
>> 128. To enable larger queue sizes, constrain the sqsize based on
>> the ib_device's max_qp_wr capability.
>> In addition, the queue size is restricted between 16 and 1024 in
>> nvmf_parse_options(), so the final value of queue depth will not
>> biger than 1024.

can you please explain why this patch series is needed ?
what is the added value you see with it ?

>>
>> And also remove unused NVME_RDMA_MAX_QUEUE_SIZE macro.
>>
>> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
>> ---
>>   drivers/nvme/host/rdma.c  | 14 ++++++++------
>>   include/linux/nvme-rdma.h |  2 --
>>   2 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 81e2621..982f3e4 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -489,8 +489,7 @@ static int nvme_rdma_create_cq(struct ib_device 
>> *ibdev,
>>   static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
>>   {
>>       struct ib_device *ibdev;
>> -    const int send_wr_factor = 3;            /* MR, SEND, INV */
>> -    const int cq_factor = send_wr_factor + 1;    /* + RECV */
>> +    const int cq_factor = NVME_RDMA_SEND_WR_FACTOR + 1;    /* + RECV */
>>       int ret, pages_per_mr;
>>       queue->device = nvme_rdma_find_get_device(queue->cm_id);
>> @@ -508,7 +507,7 @@ static int nvme_rdma_create_queue_ib(struct 
>> nvme_rdma_queue *queue)
>>       if (ret)
>>           goto out_put_dev;
>> -    ret = nvme_rdma_create_qp(queue, send_wr_factor);
>> +    ret = nvme_rdma_create_qp(queue, NVME_RDMA_SEND_WR_FACTOR);
>>       if (ret)
>>           goto out_destroy_ib_cq;
>> @@ -1006,6 +1005,7 @@ static int nvme_rdma_setup_ctrl(struct 
>> nvme_rdma_ctrl *ctrl, bool new)
>>   {
>>       int ret;
>>       bool changed;
>> +    int ib_max_qsize;
>>       ret = nvme_rdma_configure_admin_queue(ctrl, new);
>>       if (ret)
>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>> nvme_rdma_ctrl *ctrl, bool new)
>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>       }
>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
> 
> rdma_dev_max_qsize is a better name.
> 
> Also, you can drop the RFC for the next submission.
> 

Sagi,
I don't feel comfortable with these patches.
First I would like to understand the need for it.
Second, the QP WR can be constructed from one or more WQEs and the WQEs 
can be constructed from one or more WQEBBs. The max_qp_wr doesn't take 
it into account.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-20 10:52     ` Max Gurtovoy
@ 2023-12-20 19:27       ` Sagi Grimberg
  2023-12-22  6:58         ` Guixin Liu
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-12-20 19:27 UTC (permalink / raw
  To: Max Gurtovoy, Guixin Liu, hch, kbusch, kch, axboe; +Cc: linux-nvme


>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>> nvme_rdma_ctrl *ctrl, bool new)
>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>       }
>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>
>> rdma_dev_max_qsize is a better name.
>>
>> Also, you can drop the RFC for the next submission.
>>
> 
> Sagi,
> I don't feel comfortable with these patches.

Well, good that you're speaking up then ;)

> First I would like to understand the need for it.

I assumed that he stumbled on a device that did not support the
existing max of 128 nvme commands (which is 384 rdma wrs for the qp).

> Second, the QP WR can be constructed from one or more WQEs and the WQEs 
> can be constructed from one or more WQEBBs. The max_qp_wr doesn't take 
> it into account.

Well, it is not taken into account now either with the existing magic
limit in nvmet. The rdma limits reporting mechanism was and still is
unusable.

I would expect a device that has different size for different work
items to report max_qp_wr accounting for the largest work element that
the device supports, so it is universally correct.

The fact that max_qp_wr means the maximum number of slots is a qp and
at the same time different work requests can arbitrarily use any number
of slots without anyone ever knowing, makes it pretty much impossible to
use reliably.

Maybe rdma device attributes need a new attribute called
universal_max_qp_wr that is going to actually be reliable and not
guess-work?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-20 19:27       ` Sagi Grimberg
@ 2023-12-22  6:58         ` Guixin Liu
  2023-12-24  1:37           ` Max Gurtovoy
  0 siblings, 1 reply; 11+ messages in thread
From: Guixin Liu @ 2023-12-22  6:58 UTC (permalink / raw
  To: Sagi Grimberg, Max Gurtovoy, hch, kbusch, kch, axboe; +Cc: linux-nvme


在 2023/12/21 03:27, Sagi Grimberg 写道:
>
>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>       }
>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>
>>> rdma_dev_max_qsize is a better name.
>>>
>>> Also, you can drop the RFC for the next submission.
>>>
>>
>> Sagi,
>> I don't feel comfortable with these patches.
>
> Well, good that you're speaking up then ;)
>
>> First I would like to understand the need for it.
>
> I assumed that he stumbled on a device that did not support the
> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>
The situation is that I need a queue depth greater than 128.
>> Second, the QP WR can be constructed from one or more WQEs and the 
>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>> doesn't take it into account.
>
> Well, it is not taken into account now either with the existing magic
> limit in nvmet. The rdma limits reporting mechanism was and still is
> unusable.
>
> I would expect a device that has different size for different work
> items to report max_qp_wr accounting for the largest work element that
> the device supports, so it is universally correct.
>
> The fact that max_qp_wr means the maximum number of slots is a qp and
> at the same time different work requests can arbitrarily use any number
> of slots without anyone ever knowing, makes it pretty much impossible to
> use reliably.
>
> Maybe rdma device attributes need a new attribute called
> universal_max_qp_wr that is going to actually be reliable and not
> guess-work?

I see, the max_qp_wr is not as reliable as I imagined. Is there any 
another way to get a queue depth grater than 128

instead of changing NVME_RDMA_MAX_QUEUE_SIZE?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-22  6:58         ` Guixin Liu
@ 2023-12-24  1:37           ` Max Gurtovoy
  2023-12-25  8:40             ` Guixin Liu
  0 siblings, 1 reply; 11+ messages in thread
From: Max Gurtovoy @ 2023-12-24  1:37 UTC (permalink / raw
  To: Guixin Liu, Sagi Grimberg, hch, kbusch, kch, axboe; +Cc: linux-nvme



On 22/12/2023 8:58, Guixin Liu wrote:
> 
> 在 2023/12/21 03:27, Sagi Grimberg 写道:
>>
>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>       }
>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>
>>>> rdma_dev_max_qsize is a better name.
>>>>
>>>> Also, you can drop the RFC for the next submission.
>>>>
>>>
>>> Sagi,
>>> I don't feel comfortable with these patches.
>>
>> Well, good that you're speaking up then ;)
>>
>>> First I would like to understand the need for it.
>>
>> I assumed that he stumbled on a device that did not support the
>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>
> The situation is that I need a queue depth greater than 128.
>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>> doesn't take it into account.
>>
>> Well, it is not taken into account now either with the existing magic
>> limit in nvmet. The rdma limits reporting mechanism was and still is
>> unusable.
>>
>> I would expect a device that has different size for different work
>> items to report max_qp_wr accounting for the largest work element that
>> the device supports, so it is universally correct.
>>
>> The fact that max_qp_wr means the maximum number of slots is a qp and
>> at the same time different work requests can arbitrarily use any number
>> of slots without anyone ever knowing, makes it pretty much impossible to
>> use reliably.
>>
>> Maybe rdma device attributes need a new attribute called
>> universal_max_qp_wr that is going to actually be reliable and not
>> guess-work?
> 
> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
> another way to get a queue depth grater than 128
> 
> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
> 

When I added this limit to RDMA transports it was to avoid a situation 
that a QP will fail to be created if one will ask a large queue.

I choose 128 since it was supported for all the RDMA adapters I've 
tested in my lab (mostly Mellanox adapters).
For this queue depth we found that the performance is good enough and it 
will not be improved if we will increase the depth.

Are you saying that you have a device that can provide better 
performance with qdepth > 128 ?
What is the tested qdepth and what are the numbers you see with this 
qdepth ?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-24  1:37           ` Max Gurtovoy
@ 2023-12-25  8:40             ` Guixin Liu
  2023-12-25  8:59               ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Guixin Liu @ 2023-12-25  8:40 UTC (permalink / raw
  To: Max Gurtovoy, Sagi Grimberg, hch, kbusch, kch, axboe; +Cc: linux-nvme


在 2023/12/24 09:37, Max Gurtovoy 写道:
>
>
> On 22/12/2023 8:58, Guixin Liu wrote:
>>
>> 在 2023/12/21 03:27, Sagi Grimberg 写道:
>>>
>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>       }
>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>
>>>>> rdma_dev_max_qsize is a better name.
>>>>>
>>>>> Also, you can drop the RFC for the next submission.
>>>>>
>>>>
>>>> Sagi,
>>>> I don't feel comfortable with these patches.
>>>
>>> Well, good that you're speaking up then ;)
>>>
>>>> First I would like to understand the need for it.
>>>
>>> I assumed that he stumbled on a device that did not support the
>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>
>> The situation is that I need a queue depth greater than 128.
>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>> doesn't take it into account.
>>>
>>> Well, it is not taken into account now either with the existing magic
>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>> unusable.
>>>
>>> I would expect a device that has different size for different work
>>> items to report max_qp_wr accounting for the largest work element that
>>> the device supports, so it is universally correct.
>>>
>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>> at the same time different work requests can arbitrarily use any number
>>> of slots without anyone ever knowing, makes it pretty much 
>>> impossible to
>>> use reliably.
>>>
>>> Maybe rdma device attributes need a new attribute called
>>> universal_max_qp_wr that is going to actually be reliable and not
>>> guess-work?
>>
>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>> another way to get a queue depth grater than 128
>>
>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>
>
> When I added this limit to RDMA transports it was to avoid a situation 
> that a QP will fail to be created if one will ask a large queue.
>
> I choose 128 since it was supported for all the RDMA adapters I've 
> tested in my lab (mostly Mellanox adapters).
> For this queue depth we found that the performance is good enough and 
> it will not be improved if we will increase the depth.
>
> Are you saying that you have a device that can provide better 
> performance with qdepth > 128 ?
> What is the tested qdepth and what are the numbers you see with this 
> qdepth ?

Yeah, you are right, the improvement is small(about %1~2%), I do this 
only for better benchmark,

I still consist that using the capabilities of RDMA device to determine 
the size of queue is a better choice, but now I change the

NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.

Best regards,

Guixin Liu




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-25  8:40             ` Guixin Liu
@ 2023-12-25  8:59               ` Sagi Grimberg
  2023-12-25 12:36                 ` Max Gurtovoy
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-12-25  8:59 UTC (permalink / raw
  To: Guixin Liu, Max Gurtovoy, hch, kbusch, kch, axboe; +Cc: linux-nvme


>>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>>       }
>>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>>
>>>>>> rdma_dev_max_qsize is a better name.
>>>>>>
>>>>>> Also, you can drop the RFC for the next submission.
>>>>>>
>>>>>
>>>>> Sagi,
>>>>> I don't feel comfortable with these patches.
>>>>
>>>> Well, good that you're speaking up then ;)
>>>>
>>>>> First I would like to understand the need for it.
>>>>
>>>> I assumed that he stumbled on a device that did not support the
>>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>>
>>> The situation is that I need a queue depth greater than 128.
>>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>>> doesn't take it into account.
>>>>
>>>> Well, it is not taken into account now either with the existing magic
>>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>>> unusable.
>>>>
>>>> I would expect a device that has different size for different work
>>>> items to report max_qp_wr accounting for the largest work element that
>>>> the device supports, so it is universally correct.
>>>>
>>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>>> at the same time different work requests can arbitrarily use any number
>>>> of slots without anyone ever knowing, makes it pretty much 
>>>> impossible to
>>>> use reliably.
>>>>
>>>> Maybe rdma device attributes need a new attribute called
>>>> universal_max_qp_wr that is going to actually be reliable and not
>>>> guess-work?
>>>
>>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>>> another way to get a queue depth grater than 128
>>>
>>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>>
>>
>> When I added this limit to RDMA transports it was to avoid a situation 
>> that a QP will fail to be created if one will ask a large queue.
>>
>> I choose 128 since it was supported for all the RDMA adapters I've 
>> tested in my lab (mostly Mellanox adapters).
>> For this queue depth we found that the performance is good enough and 
>> it will not be improved if we will increase the depth.
>>
>> Are you saying that you have a device that can provide better 
>> performance with qdepth > 128 ?
>> What is the tested qdepth and what are the numbers you see with this 
>> qdepth ?
> 
> Yeah, you are right, the improvement is small(about %1~2%), I do this 
> only for better benchmark,

Well, it doesn't come for free, you are essentially doubling the queue
depth. I'm also assuming that you tested a single initiator and a
single queue?

> I still consist that using the capabilities of RDMA device to determine 
> the size of queue is a better choice, but now I change the
> 
> NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.

Still doesn't change the fact that its a pure guess-work if it is
supported by the device or not.

Are you even able to create that queue depth with DIF workloads?

Max, what is the maximum effective depth with DIF enabled?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
  2023-12-25  8:59               ` Sagi Grimberg
@ 2023-12-25 12:36                 ` Max Gurtovoy
  0 siblings, 0 replies; 11+ messages in thread
From: Max Gurtovoy @ 2023-12-25 12:36 UTC (permalink / raw
  To: Sagi Grimberg, Guixin Liu, hch, kbusch, kch, axboe; +Cc: linux-nvme



On 25/12/2023 10:59, Sagi Grimberg wrote:
> 
>>>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>>>       }
>>>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>>>
>>>>>>> rdma_dev_max_qsize is a better name.
>>>>>>>
>>>>>>> Also, you can drop the RFC for the next submission.
>>>>>>>
>>>>>>
>>>>>> Sagi,
>>>>>> I don't feel comfortable with these patches.
>>>>>
>>>>> Well, good that you're speaking up then ;)
>>>>>
>>>>>> First I would like to understand the need for it.
>>>>>
>>>>> I assumed that he stumbled on a device that did not support the
>>>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>>>
>>>> The situation is that I need a queue depth greater than 128.
>>>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>>>> doesn't take it into account.
>>>>>
>>>>> Well, it is not taken into account now either with the existing magic
>>>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>>>> unusable.
>>>>>
>>>>> I would expect a device that has different size for different work
>>>>> items to report max_qp_wr accounting for the largest work element that
>>>>> the device supports, so it is universally correct.
>>>>>
>>>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>>>> at the same time different work requests can arbitrarily use any 
>>>>> number
>>>>> of slots without anyone ever knowing, makes it pretty much 
>>>>> impossible to
>>>>> use reliably.
>>>>>
>>>>> Maybe rdma device attributes need a new attribute called
>>>>> universal_max_qp_wr that is going to actually be reliable and not
>>>>> guess-work?
>>>>
>>>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>>>> another way to get a queue depth grater than 128
>>>>
>>>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>>>
>>>
>>> When I added this limit to RDMA transports it was to avoid a 
>>> situation that a QP will fail to be created if one will ask a large 
>>> queue.
>>>
>>> I choose 128 since it was supported for all the RDMA adapters I've 
>>> tested in my lab (mostly Mellanox adapters).
>>> For this queue depth we found that the performance is good enough and 
>>> it will not be improved if we will increase the depth.
>>>
>>> Are you saying that you have a device that can provide better 
>>> performance with qdepth > 128 ?
>>> What is the tested qdepth and what are the numbers you see with this 
>>> qdepth ?
>>
>> Yeah, you are right, the improvement is small(about %1~2%), I do this 
>> only for better benchmark,
> 
> Well, it doesn't come for free, you are essentially doubling the queue
> depth. I'm also assuming that you tested a single initiator and a
> single queue?
> 
>> I still consist that using the capabilities of RDMA device to 
>> determine the size of queue is a better choice, but now I change the
>>
>> NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.
> 
> Still doesn't change the fact that its a pure guess-work if it is
> supported by the device or not.
> 
> Are you even able to create that queue depth with DIF workloads?
> 
> Max, what is the maximum effective depth with DIF enabled?

I'll need to check it.

I'll prepare some patches to allow RDMA queue_size to be 256 for non-pi 
controllers anyway.
I also would like to add another configfs entry to determine the max 
queue size of a target port.
hope to merge it for upcoming merge window.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-25 12:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-19  7:32 [RFC PATCH V2 0/2] *** use rdma device capability to limit queue size *** Guixin Liu
2023-12-19  7:32 ` [RFC PATCH V2 1/2] nvmet: rdma: utilize ib_device capability for setting max_queue_size Guixin Liu
2023-12-19  7:32 ` [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize Guixin Liu
2023-12-20  9:17   ` Sagi Grimberg
2023-12-20 10:52     ` Max Gurtovoy
2023-12-20 19:27       ` Sagi Grimberg
2023-12-22  6:58         ` Guixin Liu
2023-12-24  1:37           ` Max Gurtovoy
2023-12-25  8:40             ` Guixin Liu
2023-12-25  8:59               ` Sagi Grimberg
2023-12-25 12:36                 ` Max Gurtovoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).