From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 007C4C433B4 for ; Tue, 27 Apr 2021 05:15:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BC9DB613BE for ; Tue, 27 Apr 2021 05:15:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229755AbhD0FP5 (ORCPT ); Tue, 27 Apr 2021 01:15:57 -0400 Received: from ozlabs.org ([203.11.71.1]:42695 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229501AbhD0FP4 (ORCPT ); Tue, 27 Apr 2021 01:15:56 -0400 Received: by ozlabs.org (Postfix, from userid 1007) id 4FTqhc4LFBz9sWp; Tue, 27 Apr 2021 15:15:12 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1619500512; bh=lizq0Lqzk2EatXUb5ZFVvKnpEnSogf59im7Exk+3MPA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jcY2jCsPOBsYdlM8j/+8TOq2KUcw+m57VNKG0HsIhkcecy2IQIZqsYxOcmH+mrWky ZknFgql+wqVm/22DOiFpk2ra+0KooJkBzd1KdaylbIyixMESaoLZl2G+aV4KIvHiVa dwC0og3GmGtfM/27Gx0t5urtnl98Tl/mJBI8PD2k= Date: Tue, 27 Apr 2021 15:11:25 +1000 From: David Gibson To: "Tian, Kevin" Cc: Jason Gunthorpe , Alex Williamson , "Liu, Yi L" , Jacob Pan , Auger Eric , Jean-Philippe Brucker , LKML , Joerg Roedel , Lu Baolu , David Woodhouse , "iommu@lists.linux-foundation.org" , "cgroups@vger.kernel.org" , Tejun Heo , Li Zefan , Johannes Weiner , Jean-Philippe Brucker , Jonathan Corbet , "Raj, Ashok" , "Wu, Hao" , "Jiang, Dave" , Alexey Kardashevskiy Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Message-ID: References: <20210421175203.GN1370958@nvidia.com> <20210421133312.15307c44@redhat.com> <20210421230301.GP1370958@nvidia.com> <20210422111337.6ac3624d@redhat.com> <20210422175715.GA1370958@nvidia.com> <20210422133747.23322269@redhat.com> <20210422200024.GC1370958@nvidia.com> <20210422163808.2d173225@redhat.com> <20210422233950.GD1370958@nvidia.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="5hGoi/nk34Eem4E7" Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --5hGoi/nk34Eem4E7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Apr 23, 2021 at 10:31:46AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, April 23, 2021 7:40 AM > >=20 > > On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote: > >=20 > > > Because it's fundamental to the isolation of the device? What you're > > > proposing doesn't get around the group issue, it just makes it implic= it > > > rather than explicit in the uapi. > >=20 > > I'm not even sure it makes it explicit or implicit, it just takes away > > the FD. > >=20 > > There are four group IOCTLs, I see them mapping to /dev/ioasid follows: > > VFIO_GROUP_GET_STATUS - > > + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant > > + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under > > kernel/iomm_groups, or could be an IOCTL on /dev/ioasid > > IOASID_ALL_DEVICES_VIABLE > >=20 > > VFIO_GROUP_SET_CONTAINER - > > + This happens implicitly when the device joins the IOASID > > so it gets moved to the vfio_device FD: > > ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd) > >=20 > > VFIO_GROUP_UNSET_CONTAINER - > > + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD > >=20 > > VFIO_GROUP_GET_DEVICE_FD - > > + Replaced by opening /dev/vfio/deviceX > > Learn the deviceX which will be the cdev sysfs shows as: > > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev > > Open /dev/vfio/deviceX > >=20 > > > > How do we model the VFIO group security concept to something like > > > > VDPA? > > > > > > Is it really a "VFIO group security concept"? We're reflecting the > > > reality of the hardware, not all devices are fully isolated. > >=20 > > Well, exactly. > >=20 > > /dev/ioasid should understand the group concept somehow, otherwise it > > is incomplete and maybe even security broken. > >=20 > > So, how do I add groups to, say, VDPA in a way that makes sense? The > > only answer I come to is broadly what I outlined here - make > > /dev/ioasid do all the group operations, and do them when we enjoin > > the VDPA device to the ioasid. > >=20 > > Once I have solved all the groups problems with the non-VFIO users, > > then where does that leave VFIO? Why does VFIO need a group FD if > > everyone else doesn't? > >=20 > > > IOMMU group. This is the reality that any userspace driver needs to > > > play in, it doesn't magically go away because we drop the group file > > > descriptor. > >=20 > > I'm not saying it does, I'm saying it makes the uAPI more regular and > > easier to fit into /dev/ioasid without the group FD. > >=20 > > > It only makes the uapi more difficult to use correctly because > > > userspace drivers need to go outside of the uapi to have any idea > > > that this restriction exists. > >=20 > > I don't think it makes any substantive difference one way or the > > other. > >=20 > > With the group FD: the userspace has to read sysfs, find the list of > > devices in the group, open the group fd, create device FDs for each > > device using the name from sysfs. > >=20 > > Starting from a BDF the general pseudo code is > > group_path =3D readlink("/sys/bus/pci/devices/BDF/iommu_group") > > group_name =3D basename(group_path) > > group_fd =3D open("/dev/vfio/"+group_name) > > device_fd =3D ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF); > >=20 > > Without the group FD: the userspace has to read sysfs, find the list > > of devices in the group and then open the device-specific cdev (found > > via sysfs) and link them to a /dev/ioasid FD. > >=20 > > Starting from a BDF the general pseudo code is: > > device_name =3D first_directory_of("/sys/bus/pci/devices/BDF/vfio/") > > device_fd =3D open("/dev/vfio/"+device_name) > > ioasidfd =3D open("/dev/ioasid") > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd) > >=20 > > These two routes can have identical outcomes and identical security > > checks. > >=20 > > In both cases if userspace wants a list of BDFs in the same group as > > the BDF it is interested in: > > readdir("/sys/bus/pci/devices/BDF/iommu_group/devices") > >=20 > > It seems like a very small difference to me. > >=20 > > I still don't see how the group restriction gets surfaced to the > > application through the group FD. The applications I looked through > > just treat the group FD as a step on their way to get the device_fd. > >=20 >=20 > So your proposal sort of moves the entire container/group/domain=20 > managment into /dev/ioasid and then leaves vfio only provide device > specific uAPI. An ioasid represents a page table (address space), thus=20 > is equivalent to the scope of VFIO container. Right. I don't really know how /dev/iosasid is supposed to work, and so far I don't see how it conceptually differs from a container. What is it adding? > Having the device join=20 > an ioasid is equivalent to attaching a device to VFIO container, and=20 > here the group integrity must be enforced. Then /dev/ioasid anyway=20 > needs to manage group objects and their association with ioasid and=20 > underlying iommu domain thus it's pointless to keep same logic within > VFIO. Is this understanding correct? >=20 > btw one remaining open is whether you expect /dev/ioasid to be=20 > associated with a single iommu domain, or multiple. If only a single=20 > domain is allowed, the ioasid_fd is equivalent to the scope of VFIO=20 > container. It is supposed to have only one gpa_ioasid_id since one=20 > iommu domain can only have a single 2nd level pgtable. Then all other=20 > ioasids, once allocated, must be nested on this gpa_ioasid_id to fit=20 > in the same domain. if a legacy vIOMMU is exposed (which disallows=20 > nesting), the userspace has to open an ioasid_fd for every group.=20 > This is basically the VFIO way. On the other hand if multiple domains=20 > is allowed, there could be multiple ioasid_ids each holding a 2nd level= =20 > pgtable and an iommu domain (or a list of pgtables and domains due to > incompatibility issue as discussed in another thread), and can be > nested by other ioasids respectively. The application only needs > to open /dev/ioasid once regardless of whether vIOMMU allows=20 > nesting, and has a single interface for ioasid allocation. Which way > do you prefer to? >=20 > Thanks > Kevin >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --5hGoi/nk34Eem4E7 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmCHnP0ACgkQbDjKyiDZ s5L5PA//flrkhubXpzoBZLXretbuwbZdUJYevrg1C05Pn5HLMCWFgxykm2UTfz6J GgJJL+svAh1JSvHAF5WaZ0QzVBlZ1+sG9DfUuPq/yHRvLZAORVfmxNK87zSTRYgj pDl6ljMLTDjFmRNHz+mQ/y+9wyYQYId2b5f6YPvaozYCzkHiqyJTJN4dYOKZMH4J pSWePF+aNQWOOczobz91r5K56RFIgE0qTD3vuir4uZomZuU3jxVp+uhpaozXmZNz kboJ6gJCFQBxS90a568lzgjYXPS7OTKSIwJkFp1YoDI0LmxyP3SYC3RsiAcBT7xl PAhpukBdxaosIB0iC0c63a2n5WaSVyKn5lShqC0dhNw6x1rKRbv27yBLe/vjTLPt D5RQTP29btV2Qc+9u1xlbHYsYwzRjWR3n81EgxzYhv6Y8FHfEwc8zR4RuEI1Xb2y gRkOGmoY6DyxelHIn8y1/O6N4Ep2fXNKWFhmk/yE5yY7gFmp/Ih4Blq7UaT68hb5 TzykrI5dyY/hZ5/LK9wjlzA1jFIlJtMsUtyn2olebNchLGWGQpgkNmXx011clAQu n0sfpD0LWbHVVJmBTqprCIcNYcWj18wZiJ83HI/+bOcosJHCA4dHsyYoXA1BeXbn hy3SOielHLZ31fcGGA3NBcvIgynHPHYZ1eNKerIzMt4/5V83w2w= =6qQG -----END PGP SIGNATURE----- --5hGoi/nk34Eem4E7-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F14FAC433ED for ; Tue, 27 Apr 2021 05:25:26 +0000 (UTC) Received: from smtp3.osuosl.org (smtp3.osuosl.org [140.211.166.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 75B3D6052B for ; Tue, 27 Apr 2021 05:25:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 75B3D6052B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=iommu-bounces@lists.linux-foundation.org Received: from localhost (localhost [127.0.0.1]) by smtp3.osuosl.org (Postfix) with ESMTP id 3D7E6606B6; Tue, 27 Apr 2021 05:25:26 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from smtp3.osuosl.org ([127.0.0.1]) by localhost (smtp3.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3X6XPDIY5eIc; Tue, 27 Apr 2021 05:25:25 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by smtp3.osuosl.org (Postfix) with ESMTP id E33156059A; Tue, 27 Apr 2021 05:25:24 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 9A1E3C000E; Tue, 27 Apr 2021 05:25:24 +0000 (UTC) Received: from smtp4.osuosl.org (smtp4.osuosl.org [IPv6:2605:bc80:3010::137]) by lists.linuxfoundation.org (Postfix) with ESMTP id 3A5F4C000B for ; Tue, 27 Apr 2021 05:25:23 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp4.osuosl.org (Postfix) with ESMTP id 0F97A40523 for ; Tue, 27 Apr 2021 05:25:23 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Authentication-Results: smtp4.osuosl.org (amavisd-new); dkim=pass (1024-bit key) header.d=gibson.dropbear.id.au Received: from smtp4.osuosl.org ([127.0.0.1]) by localhost (smtp4.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uk1H481LDBM7 for ; Tue, 27 Apr 2021 05:25:21 +0000 (UTC) X-Greylist: delayed 00:10:02 by SQLgrey-1.8.0 Received: from ozlabs.org (bilbo.ozlabs.org [IPv6:2401:3900:2:1::2]) by smtp4.osuosl.org (Postfix) with ESMTPS id 7C4244026B for ; Tue, 27 Apr 2021 05:25:21 +0000 (UTC) Received: by ozlabs.org (Postfix, from userid 1007) id 4FTqhc4LFBz9sWp; Tue, 27 Apr 2021 15:15:12 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1619500512; bh=lizq0Lqzk2EatXUb5ZFVvKnpEnSogf59im7Exk+3MPA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jcY2jCsPOBsYdlM8j/+8TOq2KUcw+m57VNKG0HsIhkcecy2IQIZqsYxOcmH+mrWky ZknFgql+wqVm/22DOiFpk2ra+0KooJkBzd1KdaylbIyixMESaoLZl2G+aV4KIvHiVa dwC0og3GmGtfM/27Gx0t5urtnl98Tl/mJBI8PD2k= Date: Tue, 27 Apr 2021 15:11:25 +1000 From: David Gibson To: "Tian, Kevin" Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Message-ID: References: <20210421175203.GN1370958@nvidia.com> <20210421133312.15307c44@redhat.com> <20210421230301.GP1370958@nvidia.com> <20210422111337.6ac3624d@redhat.com> <20210422175715.GA1370958@nvidia.com> <20210422133747.23322269@redhat.com> <20210422200024.GC1370958@nvidia.com> <20210422163808.2d173225@redhat.com> <20210422233950.GD1370958@nvidia.com> MIME-Version: 1.0 In-Reply-To: Cc: Jean-Philippe Brucker , Li Zefan , "Jiang, Dave" , "Raj, Ashok" , Jonathan Corbet , Jean-Philippe Brucker , LKML , "iommu@lists.linux-foundation.org" , Alex Williamson , Jason Gunthorpe , Johannes Weiner , Tejun Heo , "cgroups@vger.kernel.org" , "Wu, Hao" , David Woodhouse X-BeenThere: iommu@lists.linux-foundation.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Development issues for Linux IOMMU support List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============6110362499262301655==" Errors-To: iommu-bounces@lists.linux-foundation.org Sender: "iommu" --===============6110362499262301655== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="5hGoi/nk34Eem4E7" Content-Disposition: inline --5hGoi/nk34Eem4E7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Apr 23, 2021 at 10:31:46AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, April 23, 2021 7:40 AM > >=20 > > On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote: > >=20 > > > Because it's fundamental to the isolation of the device? What you're > > > proposing doesn't get around the group issue, it just makes it implic= it > > > rather than explicit in the uapi. > >=20 > > I'm not even sure it makes it explicit or implicit, it just takes away > > the FD. > >=20 > > There are four group IOCTLs, I see them mapping to /dev/ioasid follows: > > VFIO_GROUP_GET_STATUS - > > + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant > > + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under > > kernel/iomm_groups, or could be an IOCTL on /dev/ioasid > > IOASID_ALL_DEVICES_VIABLE > >=20 > > VFIO_GROUP_SET_CONTAINER - > > + This happens implicitly when the device joins the IOASID > > so it gets moved to the vfio_device FD: > > ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd) > >=20 > > VFIO_GROUP_UNSET_CONTAINER - > > + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD > >=20 > > VFIO_GROUP_GET_DEVICE_FD - > > + Replaced by opening /dev/vfio/deviceX > > Learn the deviceX which will be the cdev sysfs shows as: > > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev > > Open /dev/vfio/deviceX > >=20 > > > > How do we model the VFIO group security concept to something like > > > > VDPA? > > > > > > Is it really a "VFIO group security concept"? We're reflecting the > > > reality of the hardware, not all devices are fully isolated. > >=20 > > Well, exactly. > >=20 > > /dev/ioasid should understand the group concept somehow, otherwise it > > is incomplete and maybe even security broken. > >=20 > > So, how do I add groups to, say, VDPA in a way that makes sense? The > > only answer I come to is broadly what I outlined here - make > > /dev/ioasid do all the group operations, and do them when we enjoin > > the VDPA device to the ioasid. > >=20 > > Once I have solved all the groups problems with the non-VFIO users, > > then where does that leave VFIO? Why does VFIO need a group FD if > > everyone else doesn't? > >=20 > > > IOMMU group. This is the reality that any userspace driver needs to > > > play in, it doesn't magically go away because we drop the group file > > > descriptor. > >=20 > > I'm not saying it does, I'm saying it makes the uAPI more regular and > > easier to fit into /dev/ioasid without the group FD. > >=20 > > > It only makes the uapi more difficult to use correctly because > > > userspace drivers need to go outside of the uapi to have any idea > > > that this restriction exists. > >=20 > > I don't think it makes any substantive difference one way or the > > other. > >=20 > > With the group FD: the userspace has to read sysfs, find the list of > > devices in the group, open the group fd, create device FDs for each > > device using the name from sysfs. > >=20 > > Starting from a BDF the general pseudo code is > > group_path =3D readlink("/sys/bus/pci/devices/BDF/iommu_group") > > group_name =3D basename(group_path) > > group_fd =3D open("/dev/vfio/"+group_name) > > device_fd =3D ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF); > >=20 > > Without the group FD: the userspace has to read sysfs, find the list > > of devices in the group and then open the device-specific cdev (found > > via sysfs) and link them to a /dev/ioasid FD. > >=20 > > Starting from a BDF the general pseudo code is: > > device_name =3D first_directory_of("/sys/bus/pci/devices/BDF/vfio/") > > device_fd =3D open("/dev/vfio/"+device_name) > > ioasidfd =3D open("/dev/ioasid") > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd) > >=20 > > These two routes can have identical outcomes and identical security > > checks. > >=20 > > In both cases if userspace wants a list of BDFs in the same group as > > the BDF it is interested in: > > readdir("/sys/bus/pci/devices/BDF/iommu_group/devices") > >=20 > > It seems like a very small difference to me. > >=20 > > I still don't see how the group restriction gets surfaced to the > > application through the group FD. The applications I looked through > > just treat the group FD as a step on their way to get the device_fd. > >=20 >=20 > So your proposal sort of moves the entire container/group/domain=20 > managment into /dev/ioasid and then leaves vfio only provide device > specific uAPI. An ioasid represents a page table (address space), thus=20 > is equivalent to the scope of VFIO container. Right. I don't really know how /dev/iosasid is supposed to work, and so far I don't see how it conceptually differs from a container. What is it adding? > Having the device join=20 > an ioasid is equivalent to attaching a device to VFIO container, and=20 > here the group integrity must be enforced. Then /dev/ioasid anyway=20 > needs to manage group objects and their association with ioasid and=20 > underlying iommu domain thus it's pointless to keep same logic within > VFIO. Is this understanding correct? >=20 > btw one remaining open is whether you expect /dev/ioasid to be=20 > associated with a single iommu domain, or multiple. If only a single=20 > domain is allowed, the ioasid_fd is equivalent to the scope of VFIO=20 > container. It is supposed to have only one gpa_ioasid_id since one=20 > iommu domain can only have a single 2nd level pgtable. Then all other=20 > ioasids, once allocated, must be nested on this gpa_ioasid_id to fit=20 > in the same domain. if a legacy vIOMMU is exposed (which disallows=20 > nesting), the userspace has to open an ioasid_fd for every group.=20 > This is basically the VFIO way. On the other hand if multiple domains=20 > is allowed, there could be multiple ioasid_ids each holding a 2nd level= =20 > pgtable and an iommu domain (or a list of pgtables and domains due to > incompatibility issue as discussed in another thread), and can be > nested by other ioasids respectively. The application only needs > to open /dev/ioasid once regardless of whether vIOMMU allows=20 > nesting, and has a single interface for ioasid allocation. Which way > do you prefer to? >=20 > Thanks > Kevin >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --5hGoi/nk34Eem4E7 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmCHnP0ACgkQbDjKyiDZ s5L5PA//flrkhubXpzoBZLXretbuwbZdUJYevrg1C05Pn5HLMCWFgxykm2UTfz6J GgJJL+svAh1JSvHAF5WaZ0QzVBlZ1+sG9DfUuPq/yHRvLZAORVfmxNK87zSTRYgj pDl6ljMLTDjFmRNHz+mQ/y+9wyYQYId2b5f6YPvaozYCzkHiqyJTJN4dYOKZMH4J pSWePF+aNQWOOczobz91r5K56RFIgE0qTD3vuir4uZomZuU3jxVp+uhpaozXmZNz kboJ6gJCFQBxS90a568lzgjYXPS7OTKSIwJkFp1YoDI0LmxyP3SYC3RsiAcBT7xl PAhpukBdxaosIB0iC0c63a2n5WaSVyKn5lShqC0dhNw6x1rKRbv27yBLe/vjTLPt D5RQTP29btV2Qc+9u1xlbHYsYwzRjWR3n81EgxzYhv6Y8FHfEwc8zR4RuEI1Xb2y gRkOGmoY6DyxelHIn8y1/O6N4Ep2fXNKWFhmk/yE5yY7gFmp/Ih4Blq7UaT68hb5 TzykrI5dyY/hZ5/LK9wjlzA1jFIlJtMsUtyn2olebNchLGWGQpgkNmXx011clAQu n0sfpD0LWbHVVJmBTqprCIcNYcWj18wZiJ83HI/+bOcosJHCA4dHsyYoXA1BeXbn hy3SOielHLZ31fcGGA3NBcvIgynHPHYZ1eNKerIzMt4/5V83w2w= =6qQG -----END PGP SIGNATURE----- --5hGoi/nk34Eem4E7-- --===============6110362499262301655== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu --===============6110362499262301655==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Gibson Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Date: Tue, 27 Apr 2021 15:11:25 +1000 Message-ID: References: <20210421175203.GN1370958@nvidia.com> <20210421133312.15307c44@redhat.com> <20210421230301.GP1370958@nvidia.com> <20210422111337.6ac3624d@redhat.com> <20210422175715.GA1370958@nvidia.com> <20210422133747.23322269@redhat.com> <20210422200024.GC1370958@nvidia.com> <20210422163808.2d173225@redhat.com> <20210422233950.GD1370958@nvidia.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="5hGoi/nk34Eem4E7" Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1619500512; bh=lizq0Lqzk2EatXUb5ZFVvKnpEnSogf59im7Exk+3MPA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jcY2jCsPOBsYdlM8j/+8TOq2KUcw+m57VNKG0HsIhkcecy2IQIZqsYxOcmH+mrWky ZknFgql+wqVm/22DOiFpk2ra+0KooJkBzd1KdaylbIyixMESaoLZl2G+aV4KIvHiVa dwC0og3GmGtfM/27Gx0t5urtnl98Tl/mJBI8PD2k= Content-Disposition: inline In-Reply-To: List-ID: To: "Tian, Kevin" Cc: Jason Gunthorpe , Alex Williamson , "Liu, Yi L" , Jacob Pan , Auger Eric , Jean-Philippe Brucker , LKML , Joerg Roedel , Lu Baolu , David Woodhouse , "iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org" , "cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Tejun Heo , Li Zefan , Johannes Weiner , Jean-Philippe Brucker , Jonathan Corbet , "Raj, Ashok" , "Wu, Hao" , J --5hGoi/nk34Eem4E7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Apr 23, 2021 at 10:31:46AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, April 23, 2021 7:40 AM > >=20 > > On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote: > >=20 > > > Because it's fundamental to the isolation of the device? What you're > > > proposing doesn't get around the group issue, it just makes it implic= it > > > rather than explicit in the uapi. > >=20 > > I'm not even sure it makes it explicit or implicit, it just takes away > > the FD. > >=20 > > There are four group IOCTLs, I see them mapping to /dev/ioasid follows: > > VFIO_GROUP_GET_STATUS - > > + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant > > + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under > > kernel/iomm_groups, or could be an IOCTL on /dev/ioasid > > IOASID_ALL_DEVICES_VIABLE > >=20 > > VFIO_GROUP_SET_CONTAINER - > > + This happens implicitly when the device joins the IOASID > > so it gets moved to the vfio_device FD: > > ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd) > >=20 > > VFIO_GROUP_UNSET_CONTAINER - > > + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD > >=20 > > VFIO_GROUP_GET_DEVICE_FD - > > + Replaced by opening /dev/vfio/deviceX > > Learn the deviceX which will be the cdev sysfs shows as: > > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev > > Open /dev/vfio/deviceX > >=20 > > > > How do we model the VFIO group security concept to something like > > > > VDPA? > > > > > > Is it really a "VFIO group security concept"? We're reflecting the > > > reality of the hardware, not all devices are fully isolated. > >=20 > > Well, exactly. > >=20 > > /dev/ioasid should understand the group concept somehow, otherwise it > > is incomplete and maybe even security broken. > >=20 > > So, how do I add groups to, say, VDPA in a way that makes sense? The > > only answer I come to is broadly what I outlined here - make > > /dev/ioasid do all the group operations, and do them when we enjoin > > the VDPA device to the ioasid. > >=20 > > Once I have solved all the groups problems with the non-VFIO users, > > then where does that leave VFIO? Why does VFIO need a group FD if > > everyone else doesn't? > >=20 > > > IOMMU group. This is the reality that any userspace driver needs to > > > play in, it doesn't magically go away because we drop the group file > > > descriptor. > >=20 > > I'm not saying it does, I'm saying it makes the uAPI more regular and > > easier to fit into /dev/ioasid without the group FD. > >=20 > > > It only makes the uapi more difficult to use correctly because > > > userspace drivers need to go outside of the uapi to have any idea > > > that this restriction exists. > >=20 > > I don't think it makes any substantive difference one way or the > > other. > >=20 > > With the group FD: the userspace has to read sysfs, find the list of > > devices in the group, open the group fd, create device FDs for each > > device using the name from sysfs. > >=20 > > Starting from a BDF the general pseudo code is > > group_path =3D readlink("/sys/bus/pci/devices/BDF/iommu_group") > > group_name =3D basename(group_path) > > group_fd =3D open("/dev/vfio/"+group_name) > > device_fd =3D ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF); > >=20 > > Without the group FD: the userspace has to read sysfs, find the list > > of devices in the group and then open the device-specific cdev (found > > via sysfs) and link them to a /dev/ioasid FD. > >=20 > > Starting from a BDF the general pseudo code is: > > device_name =3D first_directory_of("/sys/bus/pci/devices/BDF/vfio/") > > device_fd =3D open("/dev/vfio/"+device_name) > > ioasidfd =3D open("/dev/ioasid") > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd) > >=20 > > These two routes can have identical outcomes and identical security > > checks. > >=20 > > In both cases if userspace wants a list of BDFs in the same group as > > the BDF it is interested in: > > readdir("/sys/bus/pci/devices/BDF/iommu_group/devices") > >=20 > > It seems like a very small difference to me. > >=20 > > I still don't see how the group restriction gets surfaced to the > > application through the group FD. The applications I looked through > > just treat the group FD as a step on their way to get the device_fd. > >=20 >=20 > So your proposal sort of moves the entire container/group/domain=20 > managment into /dev/ioasid and then leaves vfio only provide device > specific uAPI. An ioasid represents a page table (address space), thus=20 > is equivalent to the scope of VFIO container. Right. I don't really know how /dev/iosasid is supposed to work, and so far I don't see how it conceptually differs from a container. What is it adding? > Having the device join=20 > an ioasid is equivalent to attaching a device to VFIO container, and=20 > here the group integrity must be enforced. Then /dev/ioasid anyway=20 > needs to manage group objects and their association with ioasid and=20 > underlying iommu domain thus it's pointless to keep same logic within > VFIO. Is this understanding correct? >=20 > btw one remaining open is whether you expect /dev/ioasid to be=20 > associated with a single iommu domain, or multiple. If only a single=20 > domain is allowed, the ioasid_fd is equivalent to the scope of VFIO=20 > container. It is supposed to have only one gpa_ioasid_id since one=20 > iommu domain can only have a single 2nd level pgtable. Then all other=20 > ioasids, once allocated, must be nested on this gpa_ioasid_id to fit=20 > in the same domain. if a legacy vIOMMU is exposed (which disallows=20 > nesting), the userspace has to open an ioasid_fd for every group.=20 > This is basically the VFIO way. On the other hand if multiple domains=20 > is allowed, there could be multiple ioasid_ids each holding a 2nd level= =20 > pgtable and an iommu domain (or a list of pgtables and domains due to > incompatibility issue as discussed in another thread), and can be > nested by other ioasids respectively. The application only needs > to open /dev/ioasid once regardless of whether vIOMMU allows=20 > nesting, and has a single interface for ioasid allocation. Which way > do you prefer to? >=20 > Thanks > Kevin >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --5hGoi/nk34Eem4E7 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmCHnP0ACgkQbDjKyiDZ s5L5PA//flrkhubXpzoBZLXretbuwbZdUJYevrg1C05Pn5HLMCWFgxykm2UTfz6J GgJJL+svAh1JSvHAF5WaZ0QzVBlZ1+sG9DfUuPq/yHRvLZAORVfmxNK87zSTRYgj pDl6ljMLTDjFmRNHz+mQ/y+9wyYQYId2b5f6YPvaozYCzkHiqyJTJN4dYOKZMH4J pSWePF+aNQWOOczobz91r5K56RFIgE0qTD3vuir4uZomZuU3jxVp+uhpaozXmZNz kboJ6gJCFQBxS90a568lzgjYXPS7OTKSIwJkFp1YoDI0LmxyP3SYC3RsiAcBT7xl PAhpukBdxaosIB0iC0c63a2n5WaSVyKn5lShqC0dhNw6x1rKRbv27yBLe/vjTLPt D5RQTP29btV2Qc+9u1xlbHYsYwzRjWR3n81EgxzYhv6Y8FHfEwc8zR4RuEI1Xb2y gRkOGmoY6DyxelHIn8y1/O6N4Ep2fXNKWFhmk/yE5yY7gFmp/Ih4Blq7UaT68hb5 TzykrI5dyY/hZ5/LK9wjlzA1jFIlJtMsUtyn2olebNchLGWGQpgkNmXx011clAQu n0sfpD0LWbHVVJmBTqprCIcNYcWj18wZiJ83HI/+bOcosJHCA4dHsyYoXA1BeXbn hy3SOielHLZ31fcGGA3NBcvIgynHPHYZ1eNKerIzMt4/5V83w2w= =6qQG -----END PGP SIGNATURE----- --5hGoi/nk34Eem4E7--