From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755010AbbFPPvy (ORCPT ); Tue, 16 Jun 2015 11:51:54 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:32782 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751994AbbFPPvp (ORCPT ); Tue, 16 Jun 2015 11:51:45 -0400 MIME-Version: 1.0 In-Reply-To: <55802F46.7020804@unitedstack.com> References: <557EB47F.6090708@unitedstack.com> <557ED1D4.20605@unitedstack.com> <557F97CB.6070608@unitedstack.com> <55800F24.6060100@unitedstack.com> <55802F46.7020804@unitedstack.com> Date: Tue, 16 Jun 2015 18:51:43 +0300 Message-ID: Subject: Re: [PATCH RFC] storage:rbd: make the size of request is equal to the, size of the object From: Ilya Dryomov To: juncheng bai Cc: idryomov@redhat.com, Alex Elder , Josh Durgin , Guangliang Zhao , jeff@garzik.org, yehuda@hq.newdream.net, Sage Weil , elder@inktank.com, "linux-kernel@vger.kernel.org" , Ceph Development Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2015 at 5:14 PM, juncheng bai wrote: > > > On 2015/6/16 21:30, Ilya Dryomov wrote: >> >> On Tue, Jun 16, 2015 at 2:57 PM, juncheng bai >> wrote: >>> >>> >>> >>> On 2015/6/16 16:37, Ilya Dryomov wrote: >>>> >>>> >>>> On Tue, Jun 16, 2015 at 6:28 AM, juncheng bai >>>> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> On 2015/6/15 22:27, Ilya Dryomov wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 15, 2015 at 4:23 PM, juncheng bai >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2015/6/15 21:03, Ilya Dryomov wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 15, 2015 at 2:18 PM, juncheng bai >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> From 6213215bd19926d1063d4e01a248107dab8a899b Mon Sep 17 >>>>>>>>> 00:00:00 >>>>>>>>> 2001 >>>>>>>>> From: juncheng bai >>>>>>>>> Date: Mon, 15 Jun 2015 18:34:00 +0800 >>>>>>>>> Subject: [PATCH] storage:rbd: make the size of request is equal to >>>>>>>>> the >>>>>>>>> size of the object >>>>>>>>> >>>>>>>>> ensures that the merged size of request can achieve the size of >>>>>>>>> the object. >>>>>>>>> when merge a bio to request or merge a request to request, the >>>>>>>>> sum of the segment number of the current request and the segment >>>>>>>>> number of the bio is not greater than the max segments of the >>>>>>>>> request, >>>>>>>>> so the max size of request is 512k if the max segments of request >>>>>>>>> is >>>>>>>>> BLK_MAX_SEGMENTS. >>>>>>>>> >>>>>>>>> Signed-off-by: juncheng bai >>>>>>>>> --- >>>>>>>>> drivers/block/rbd.c | 2 ++ >>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>>>>>> index 0a54c58..dec6045 100644 >>>>>>>>> --- a/drivers/block/rbd.c >>>>>>>>> +++ b/drivers/block/rbd.c >>>>>>>>> @@ -3757,6 +3757,8 @@ static int rbd_init_disk(struct rbd_device >>>>>>>>> *rbd_dev) >>>>>>>>> segment_size = rbd_obj_bytes(&rbd_dev->header); >>>>>>>>> blk_queue_max_hw_sectors(q, segment_size / >>>>>>>>> SECTOR_SIZE); >>>>>>>>> blk_queue_max_segment_size(q, segment_size); >>>>>>>>> + if (segment_size > BLK_MAX_SEGMENTS * PAGE_SIZE) >>>>>>>>> + blk_queue_max_segments(q, segment_size / >>>>>>>>> PAGE_SIZE); >>>>>>>>> blk_queue_io_min(q, segment_size); >>>>>>>>> blk_queue_io_opt(q, segment_size); >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I made a similar patch on Friday, investigating blk-mq plugging >>>>>>>> issue >>>>>>>> reported by Nick. My patch sets it to BIO_MAX_PAGES unconditionally >>>>>>>> - >>>>>>>> AFAIU there is no point in setting to anything bigger since the bios >>>>>>>> will be clipped to that number of vecs. Given that BIO_MAX_PAGES is >>>>>>>> 256, this gives is 1M direct I/Os. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi. For signal bio, the max number of bio_vec is BIO_MAX_PAGES, but a >>>>>>> request can be merged from multiple bios. We can see the below >>>>>>> function: >>>>>>> ll_back_merge_fn, ll_front_merge_fn and etc. >>>>>>> And I test in kernel 3.18 use this patch, and do: >>>>>>> echo 4096 > /sys/block/rbd0/queue/max_sectors_kb >>>>>>> We use systemtap to trace the request size, It is upto 4M. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Kernel 3.18 is pre rbd blk-mq transition, which happened in 4.0. You >>>>>> should test whatever patches you have with at least 4.0. >>>>>> >>>>>> Putting that aside, I must be missing something. You'll get 4M >>>>>> requests on 3.18 both with your patch and without it, the only >>>>>> difference would be the size of bios being merged - 512k vs 1M. Can >>>>>> you describe your test workload and provide before and after traces? >>>>>> >>>>> Hi. I update kernel version to 4.0.5. The test information as shown >>>>> below: >>>>> The base information: >>>>> 03:28:13-root@server-186:~$uname -r >>>>> 4.0.5 >>>>> >>>>> My simple systemtap script: >>>>> probe module("rbd").function("rbd_img_request_create") >>>>> { >>>>> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >>>>> } >>>>> >>>>> I use dd to execute the test case: >>>>> dd if=/dev/zero of=/dev/rbd0 bs=4M count=1 oflag=direct >>>>> >>>>> Case one: Without patch >>>>> 03:30:23-root@server-186:~$cat /sys/block/rbd0/queue/max_sectors_kb >>>>> 4096 >>>>> 03:30:35-root@server-186:~$cat /sys/block/rbd0/queue/max_segments >>>>> 128 >>>>> >>>>> The output of systemtap for nornal data: >>>>> offset:0 length:524288 >>>>> offset:524288 length:524288 >>>>> offset:1048576 length:524288 >>>>> offset:1572864 length:524288 >>>>> offset:2097152 length:524288 >>>>> offset:2621440 length:524288 >>>>> offset:3145728 length:524288 >>>>> offset:3670016 length:524288 >>>>> >>>>> Case two:With patch >>>>> cat /sys/block/rbd0/queue/max_sectors_kb >>>>> 4096 >>>>> 03:49:14-root@server-186:linux-4.0.5$cat >>>>> /sys/block/rbd0/queue/max_segments >>>>> 1024 >>>>> The output of systemtap for nornal data: >>>>> offset:0 length:1048576 >>>>> offset:1048576 length:1048576 >>>>> offset:2097152 length:1048576 >>>>> offset:3145728 length:1048576 >>>>> >>>>> According to the test, you are right. >>>>> Because the blk-mq doesn't use any scheduling policy. >>>>> 03:52:13-root@server-186:linux-4.0.5$cat >>>>> /sys/block/rbd0/queue/scheduler >>>>> none >>>>> >>>>> In previous versions of the kernel 4.0, the rbd use the defualt >>>>> scheduler:cfq >>>>> >>>>> So, I think that the blk-mq need to do more? >>>> >>>> >>>> >>>> There is no scheduler support in blk-mq as of now but your numbers >>>> don't have anything to do with that. The current behaviour is a result >>>> of a bug in blk-mq. It's fixed by [1], if you apply it you should see >>>> 4M requests with your stap script. >>>> >>>> [1] http://article.gmane.org/gmane.linux.kernel/1941750 >>>> >>> Hi. >>> First, Let's look at the result in the kernel version 3.18 >>> The function blk_limits_max_hw_sectors different implemention between >>> 3.18 >>> and 4.0+. We need do: >>> echo 4094 >/sys/block/rbd0/queue/max_sectors_kb >>> >>> The rbd device information: >>> 11:13:18-root@server-186:~$cat /sys/block/rbd0/queue/max_sectors_kb >>> 4094 >>> 11:15:28-root@server-186:~$cat /sys/block/rbd0/queue/max_segments >>> 1024 >>> >>> The test command: >>> dd if=/dev/zero of=/dev/rbd0 bs=4M count=1 >>> >>> The simple stap script: >>> probe module("rbd").function("rbd_img_request_create") >>> { >>> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >>> } >>> >>> The output from stap: >>> offset:0 length:4190208 >>> offset:21474770944 length:4096 >>> >>> Second, thanks for your patch [1]. >>> I use the patch [1], and recompile the kernel. >>> The test information as shown below: >>> 12:26:12-root@server-186:$cat /sys/block/rbd0/queue/max_segments >>> 1024 >>> 12:26:23-root@server-186:$cat /sys/block/rbd0/queue/max_sectors_kb >>> 4096 >>> >>> The test command: >>> dd if=/dev/zero of=/dev/rbd0 bs=4M count=2 oflag=direct >>> >>> The simple systemtap script: >>> probe module("rbd").function("rbd_img_request_create") >>> { >>> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >>> } >>> >>> The output of systemtap for nornal data: >>> offset:0 length:4194304 >>> offset:4194304 length:4194304 >>> offset:21474770944 length:4096 >> >> >> Sorry, I fail to see the purpose of the above tests. The test commands >> differ, the kernels differ and it looks like you had your patch applied >> for both tests. What I'm trying to get you to do is to show me some >> data that will back your claim (which your patch is based on): >> >>> >>> So, I think that the max_segments of request_limits should be divide the >>> object size by PAGE_SIZE. >> >> >> For that you need to use the same kernel and run the same workload. >> The only difference should be whether your patch is applied or not. >> I still think that setting rbd max_segments to anything above >> BIO_MAX_PAGES is bogus, but I'd be happy to be shown wrong on that >> since that would mean better performance, at least in some >> workloads. >> > Hi. > For cloned image, it will avoid doing copyup if the request size is > equal to the object size, I think that it is the key effect of this > patch. > The big request would result in overtime if the ceph backend is busy > or the network bandwidth is too low. You are right, but then again: we get rbd object size sized requests even with the default max_segments. This is true for both < 4.0 and >= 4.0 kernels (with the plugging fix applied). > I suggest that add a module parameter to control the value which > decided by the user settings. A module parameter for what exactly? Thanks, Ilya