From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3DA2C43381 for ; Fri, 22 Mar 2019 08:04:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 79EF02075E for ; Fri, 22 Mar 2019 08:04:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nutanix.com header.i=@nutanix.com header.b="bK1iFDPN" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727490AbfCVIEt (ORCPT ); Fri, 22 Mar 2019 04:04:49 -0400 Received: from mx0a-002c1b01.pphosted.com ([148.163.151.68]:34074 "EHLO mx0a-002c1b01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725981AbfCVIEs (ORCPT ); Fri, 22 Mar 2019 04:04:48 -0400 X-Greylist: delayed 565 seconds by postgrey-1.27 at vger.kernel.org; Fri, 22 Mar 2019 04:04:47 EDT Received: from pps.filterd (m0127840.ppops.net [127.0.0.1]) by mx0a-002c1b01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x2M7o2Da026410; Fri, 22 Mar 2019 00:54:52 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nutanix.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=proofpoint20171006; bh=EZ8R9BC3frBOI3ZSptX3fGBZLYT12HH3RyVo/XixI5k=; b=bK1iFDPNpQzGPKYe6nApHOlSZMnxg8W3of+ONZ47DdkINSMheUwhZsO1GNoPFqWncImc WiPbkRLPC53h1+qSdS/JkUcqsv8UJez94OZzAUYWh43GyA9U87Vmm4sfiFn8vylvMAS6 XsGgaXoLrr86JMBjkMIkkd3v/D1/nr1czig0JI25XXorMmn5gRTaZAy0IKU37BnWvo7e KMoj0JU/K+NWO8rUPXYAup7HavIEREI87WOvpmw7aTTNwS5a8vDJbHs0imTKxVSenIQo 7kSPCfemwTrJwavbjdM5Ckugd8pJcAGfKvuv3snRGh8blNKrWs7aJHSwslHK+gfzGoPD 7Q== Received: from nam04-co1-obe.outbound.protection.outlook.com (mail-co1nam04lp2051.outbound.protection.outlook.com [104.47.45.51]) by mx0a-002c1b01.pphosted.com with ESMTP id 2rceta16t1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Fri, 22 Mar 2019 00:54:52 -0700 Received: from MWHPR02MB2656.namprd02.prod.outlook.com (10.168.206.18) by MWHPR02MB2480.namprd02.prod.outlook.com (10.168.204.150) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1730.15; Fri, 22 Mar 2019 07:54:50 +0000 Received: from MWHPR02MB2656.namprd02.prod.outlook.com ([fe80::21ac:e2b1:ea12:2266]) by MWHPR02MB2656.namprd02.prod.outlook.com ([fe80::21ac:e2b1:ea12:2266%10]) with mapi id 15.20.1730.017; Fri, 22 Mar 2019 07:54:50 +0000 From: Felipe Franciosi To: Maxim Levitsky CC: Keith Busch , Stefan Hajnoczi , Fam Zheng , "kvm@vger.kernel.org" , Wolfram Sang , "linux-nvme@lists.infradead.org" , "linux-kernel@vger.kernel.org" , Keith Busch , Kirti Wankhede , Mauro Carvalho Chehab , "Paul E . McKenney" , Christoph Hellwig , Sagi Grimberg , "Harris, James R" , Liang Cunming , Jens Axboe , Alex Williamson , Thanos Makatos , John Ferlan , Liu Changpeng , Greg Kroah-Hartman , Nicolas Ferre , Paolo Bonzini , Amnon Ilan , "David S . Miller" Subject: Re: Thread-Index: AQHU4ISGDFUGO+4rpUu49Zh8vFOt4Q== Date: Fri, 22 Mar 2019 07:54:50 +0000 Message-ID: <0E8918CB-F679-4A5C-92AD-239E9CEC260C@nutanix.com> References: <20190319144116.400-1-mlevitsk@redhat.com> <488768D7-1396-4DD1-A648-C86E5CF7DB2F@nutanix.com> <42f444d22363bc747f4ad75e9f0c27b40a810631.camel@redhat.com> <20190321161239.GH31434@stefanha-x1.localdomain> <20190321162140.GA29342@localhost.localdomain> <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> In-Reply-To: <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [62.254.189.133] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 5c6106c6-8b74-44ee-205c-08d6ae9ba94c x-microsoft-antispam: BCL:0;PCL:0;RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600127)(711020)(4605104)(2017052603328)(7153060)(7193020);SRVR:MWHPR02MB2480; x-ms-traffictypediagnostic: MWHPR02MB2480: x-proofpoint-crosstenant: true x-microsoft-antispam-prvs: x-forefront-prvs: 09840A4839 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(376002)(136003)(396003)(366004)(39860400002)(346002)(189003)(199004)(6916009)(229853002)(99286004)(6116002)(3846002)(76176011)(7736002)(305945005)(7116003)(7416002)(106356001)(105586002)(3480700005)(33656002)(2906002)(14454004)(5660300002)(478600001)(446003)(81166006)(86362001)(8676002)(6436002)(81156014)(6486002)(4326008)(486006)(2616005)(476003)(11346002)(25786009)(66066001)(221173001)(6512007)(6246003)(53936002)(8936002)(97736004)(102836004)(316002)(68736007)(82746002)(93886005)(256004)(14444005)(186003)(36756003)(26005)(54906003)(83716004)(71190400001)(71200400001)(53546011)(6506007)(4743002)(64030200001);DIR:OUT;SFP:1102;SCL:1;SRVR:MWHPR02MB2480;H:MWHPR02MB2656.namprd02.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;MX:1;A:1; received-spf: None (protection.outlook.com: nutanix.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: gxi3CCu2iEHt7ELJfJmKevLBj7N325WqhQ0o1Giwzbkv7GstO6WxgdN685mL9B9YzlRqwE4S1xz3KsbfkhwrBsxgXYYyOMav2J8Tl8ycmNpdu5k/oFSar4Vihkkok+2Z/GSaSWw3S8yZ/cZOXEJFfKtqa5Aq0+6ET9RL4rLCEysHRUEfa2AzCO2UQ6ivwivqGFcqSaoVAUMDW6sr2wR+yFqAchkYSzJBhZ6RGd4tO7PDuSwWMXCUmhcG91BeHt4zGtbZkn9wyxRKcC/liubN/MyKxZdTpTfaAu4Yj4rj/qflFka0AAjlsuNAIfxgK7NgifYZPPL+jBi0x5x9m5Z+0fqKSJm9kxaXwx/LPDNTrl9GPlFDxl4HpGsZB4lId9Y2uNzz9JSJ4ObpeQiCikCHSL+4WrMGEZAPrAwf/n2D86M= Content-Type: text/plain; charset="us-ascii" Content-ID: <06727B797CBF14419894608A12BEBB3F@namprd02.prod.outlook.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: nutanix.com X-MS-Exchange-CrossTenant-Network-Message-Id: 5c6106c6-8b74-44ee-205c-08d6ae9ba94c X-MS-Exchange-CrossTenant-originalarrivaltime: 22 Mar 2019 07:54:50.1424 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: bb047546-786f-4de1-bd75-24e5b6f79043 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR02MB2480 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-22_05:,, signatures=0 X-Proofpoint-Spam-Reason: safe Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Mar 21, 2019, at 5:04 PM, Maxim Levitsky wrote: >=20 > On Thu, 2019-03-21 at 16:41 +0000, Felipe Franciosi wrote: >>> On Mar 21, 2019, at 4:21 PM, Keith Busch wrote: >>>=20 >>> On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote: >>>> mdev-nvme seems like a duplication of SPDK. The performance is not >>>> better and the features are more limited, so why focus on this approac= h? >>>>=20 >>>> One argument might be that the kernel NVMe subsystem wants to offer th= is >>>> functionality and loading the kernel module is more convenient than >>>> managing SPDK to some users. >>>>=20 >>>> Thoughts? >>>=20 >>> Doesn't SPDK bind a controller to a single process? mdev binds to >>> namespaces (or their partitions), so you could have many mdev's assigne= d >>> to many VMs accessing a single controller. >>=20 >> Yes, it binds to a single process which can drive the datapath of multip= le >> virtual controllers for multiple VMs (similar to what you described for = mdev). >> You can therefore efficiently poll multiple VM submission queues (and mu= ltiple >> device completion queues) from a single physical CPU. >>=20 >> The same could be done in the kernel, but the code gets complicated as y= ou add >> more functionality to it. As this is a direct interface with an untruste= d >> front-end (the guest), it's also arguably safer to do in userspace. >>=20 >> Worth noting: you can eventually have a single physical core polling all= sorts >> of virtual devices (eg. virtual storage or network controllers) very >> efficiently. And this is quite configurable, too. In the interest of fai= rness, >> performance or efficiency, you can choose to dynamically add or remove q= ueues >> to the poll thread or spawn more threads and redistribute the work. >>=20 >> F. >=20 > Note though that SPDK doesn't support sharing the device between host and= the > guests, it takes over the nvme device, thus it makes the kernel nvme driv= er > unbind from it. That is absolutely true. However, I find it not to be a problem in practice= . Hypervisor products, specially those caring about performance, efficiency a= nd fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk= storage, cache, metadata) and will not share these devices for other use c= ases. That's because these products want to deterministically control the p= erformance aspects of the device, which you just cannot do if you are shari= ng the device with a subsystem you do not control. For scenarios where the device must be shared and such fine grained control= is not required, it looks like using the kernel driver with io_uring offer= s very good performance with flexibility. Cheers, Felipe= From mboxrd@z Thu Jan 1 00:00:00 1970 From: felipe@nutanix.com (Felipe Franciosi) Date: Fri, 22 Mar 2019 07:54:50 +0000 Subject: No subject In-Reply-To: <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> References: <20190319144116.400-1-mlevitsk@redhat.com> <488768D7-1396-4DD1-A648-C86E5CF7DB2F@nutanix.com> <42f444d22363bc747f4ad75e9f0c27b40a810631.camel@redhat.com> <20190321161239.GH31434@stefanha-x1.localdomain> <20190321162140.GA29342@localhost.localdomain> <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> Message-ID: <0E8918CB-F679-4A5C-92AD-239E9CEC260C@nutanix.com> > On Mar 21, 2019,@5:04 PM, Maxim Levitsky wrote: > > On Thu, 2019-03-21@16:41 +0000, Felipe Franciosi wrote: >>> On Mar 21, 2019,@4:21 PM, Keith Busch wrote: >>> >>> On Thu, Mar 21, 2019@04:12:39PM +0000, Stefan Hajnoczi wrote: >>>> mdev-nvme seems like a duplication of SPDK. The performance is not >>>> better and the features are more limited, so why focus on this approach? >>>> >>>> One argument might be that the kernel NVMe subsystem wants to offer this >>>> functionality and loading the kernel module is more convenient than >>>> managing SPDK to some users. >>>> >>>> Thoughts? >>> >>> Doesn't SPDK bind a controller to a single process? mdev binds to >>> namespaces (or their partitions), so you could have many mdev's assigned >>> to many VMs accessing a single controller. >> >> Yes, it binds to a single process which can drive the datapath of multiple >> virtual controllers for multiple VMs (similar to what you described for mdev). >> You can therefore efficiently poll multiple VM submission queues (and multiple >> device completion queues) from a single physical CPU. >> >> The same could be done in the kernel, but the code gets complicated as you add >> more functionality to it. As this is a direct interface with an untrusted >> front-end (the guest), it's also arguably safer to do in userspace. >> >> Worth noting: you can eventually have a single physical core polling all sorts >> of virtual devices (eg. virtual storage or network controllers) very >> efficiently. And this is quite configurable, too. In the interest of fairness, >> performance or efficiency, you can choose to dynamically add or remove queues >> to the poll thread or spawn more threads and redistribute the work. >> >> F. > > Note though that SPDK doesn't support sharing the device between host and the > guests, it takes over the nvme device, thus it makes the kernel nvme driver > unbind from it. That is absolutely true. However, I find it not to be a problem in practice. Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control. For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility. Cheers, Felipe From mboxrd@z Thu Jan 1 00:00:00 1970 From: Felipe Franciosi Subject: Re: Date: Fri, 22 Mar 2019 07:54:50 +0000 Message-ID: <0E8918CB-F679-4A5C-92AD-239E9CEC260C@nutanix.com> References: <20190319144116.400-1-mlevitsk@redhat.com> <488768D7-1396-4DD1-A648-C86E5CF7DB2F@nutanix.com> <42f444d22363bc747f4ad75e9f0c27b40a810631.camel@redhat.com> <20190321161239.GH31434@stefanha-x1.localdomain> <20190321162140.GA29342@localhost.localdomain> <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: Keith Busch , Stefan Hajnoczi , Fam Zheng , "kvm@vger.kernel.org" , Wolfram Sang , "linux-nvme@lists.infradead.org" , "linux-kernel@vger.kernel.org" , Keith Busch , Kirti Wankhede , Mauro Carvalho Chehab , "Paul E . McKenney" , Christoph Hellwig , Sagi Grimberg , "Harris, James R" , Liang Cunming , Jens Axboe , Alex Williamson , Thanos Makatos , To: Maxim Levitsky Return-path: In-Reply-To: <8698ad583b1cfe86afc3d5440be630fc3e8e0680.camel@redhat.com> Content-Language: en-US Content-ID: <06727B797CBF14419894608A12BEBB3F@namprd02.prod.outlook.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org > On Mar 21, 2019, at 5:04 PM, Maxim Levitsky wrote: >=20 > On Thu, 2019-03-21 at 16:41 +0000, Felipe Franciosi wrote: >>> On Mar 21, 2019, at 4:21 PM, Keith Busch wrote: >>>=20 >>> On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote: >>>> mdev-nvme seems like a duplication of SPDK. The performance is not >>>> better and the features are more limited, so why focus on this approac= h? >>>>=20 >>>> One argument might be that the kernel NVMe subsystem wants to offer th= is >>>> functionality and loading the kernel module is more convenient than >>>> managing SPDK to some users. >>>>=20 >>>> Thoughts? >>>=20 >>> Doesn't SPDK bind a controller to a single process? mdev binds to >>> namespaces (or their partitions), so you could have many mdev's assigne= d >>> to many VMs accessing a single controller. >>=20 >> Yes, it binds to a single process which can drive the datapath of multip= le >> virtual controllers for multiple VMs (similar to what you described for = mdev). >> You can therefore efficiently poll multiple VM submission queues (and mu= ltiple >> device completion queues) from a single physical CPU. >>=20 >> The same could be done in the kernel, but the code gets complicated as y= ou add >> more functionality to it. As this is a direct interface with an untruste= d >> front-end (the guest), it's also arguably safer to do in userspace. >>=20 >> Worth noting: you can eventually have a single physical core polling all= sorts >> of virtual devices (eg. virtual storage or network controllers) very >> efficiently. And this is quite configurable, too. In the interest of fai= rness, >> performance or efficiency, you can choose to dynamically add or remove q= ueues >> to the poll thread or spawn more threads and redistribute the work. >>=20 >> F. >=20 > Note though that SPDK doesn't support sharing the device between host and= the > guests, it takes over the nvme device, thus it makes the kernel nvme driv= er > unbind from it. That is absolutely true. However, I find it not to be a problem in practice= . Hypervisor products, specially those caring about performance, efficiency a= nd fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk= storage, cache, metadata) and will not share these devices for other use c= ases. That's because these products want to deterministically control the p= erformance aspects of the device, which you just cannot do if you are shari= ng the device with a subsystem you do not control. For scenarios where the device must be shared and such fine grained control= is not required, it looks like using the kernel driver with io_uring offer= s very good performance with flexibility. Cheers, Felipe=