From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A47CEC54E41 for ; Wed, 6 Mar 2024 14:44:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F14E16B0071; Wed, 6 Mar 2024 09:44:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EC3736B0072; Wed, 6 Mar 2024 09:44:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB2126B0074; Wed, 6 Mar 2024 09:44:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C93906B0071 for ; Wed, 6 Mar 2024 09:44:25 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 962491C0D2A for ; Wed, 6 Mar 2024 14:44:25 +0000 (UTC) X-FDA: 81866884890.23.05F1BF5 Received: from verein.lst.de (verein.lst.de [213.95.11.211]) by imf21.hostedemail.com (Postfix) with ESMTP id B79731C0023 for ; Wed, 6 Mar 2024 14:44:23 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of hch@lst.de designates 213.95.11.211 as permitted sender) smtp.mailfrom=hch@lst.de; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709736264; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ubmPeJg46t3K7leg5ZeQN7O0CINWlIT7YUSXrdtujmc=; b=yPBzVcC0IIj6zl8rUzjc3LXziGEMdYIpwrOFY0jArCDl40zjASUYqsgJobt9kolFRzW4Cb UU/Zn+v0JFZKp39eBYdiBvLp+O2W69S6ztE0PgxsXl/ZWTP59M3BYF7QW+tLb5Yn6Vu2Dz JDwgf8IgdQlRLoNxKtZB5PiqymALu7U= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of hch@lst.de designates 213.95.11.211 as permitted sender) smtp.mailfrom=hch@lst.de; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709736264; a=rsa-sha256; cv=none; b=5sHOwXuNEDabkhZ8ypAcPk5zixeJow2VqtT7+IXONaRc0rNPbxyFVByihuOsEvTPnCIH37 1a+AXkZLJ0ZXAwnGQMRgBs6/Lb/4Q+kOpMxYcjcfl+kBYGtHYuKWGnolEvIWIuswi4RPHE iOswtetDgM80l/Uvn9sB/d3yohvYkj8= Received: by verein.lst.de (Postfix, from userid 2407) id 6727E68C4E; Wed, 6 Mar 2024 15:44:17 +0100 (CET) Date: Wed, 6 Mar 2024 15:44:16 +0100 From: Christoph Hellwig To: Leon Romanovsky Cc: Robin Murphy , Christoph Hellwig , Marek Szyprowski , Joerg Roedel , Will Deacon , Jason Gunthorpe , Chaitanya Kulkarni , Jonathan Corbet , Jens Axboe , Keith Busch , Sagi Grimberg , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , =?iso-8859-1?B?Suly9G1l?= Glisse , Andrew Morton , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, linux-mm@kvack.org, Bart Van Assche , Damien Le Moal , Amir Goldstein , "josef@toxicpanda.com" , "Martin K. Petersen" , "daniel@iogearbox.net" , Dan Williams , "jack@suse.com" , Zhu Yanjun Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Message-ID: <20240306144416.GB19711@lst.de> References: <47afacda-3023-4eb7-b227-5f725c3187c2@arm.com> <20240305122935.GB36868@unreal> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240305122935.GB36868@unreal> User-Agent: Mutt/1.5.17 (2007-11-01) X-Rspamd-Queue-Id: B79731C0023 X-Rspam-User: X-Stat-Signature: 35afr4komkd9wubkwixpy7biq8k4ppib X-Rspamd-Server: rspam01 X-HE-Tag: 1709736263-804248 X-HE-Meta: U2FsdGVkX182pppmdnM7u3NoVMUBYjYWlBPgQO/E7bihuJhj2luJC/WoTVobhIW4/HKYDPTICF3iWVm1w5vESKU4UsTlvOy4AueW0+8if3pATP3+JGUdH78ls38sgVr9fmhf9E4/HiuMyZV5xgVpKHdsCehMfcIaKw0xfSe90LN1oK0h1oE2GfSOaMcYG2qzU7e/3XzBMHNlfSqs/8B77uqQdXsiamdM0DjhOV50blvjhbO4PPZCGnkGMxSlRYf9nj4LeCGGVTa9njNYut1dBVmXtp4ya+kRa2ZUsUlAnzSE0UzixIAxvhSBhxS4DIndmYa4TGaOiHY2JXZrHmaRNfp7CwoXL+R5ZbzvxCt7q/HSetL+TsiC7d1Lv/J3aARdG4YR/jbam63bbLUwsz2+mT7YiR2SVmwI98eDafOenANOeExoWWWzbfcsDSdxJnsWOKu8rXWQTujJEZMcztnV8K5X++NMvVGueEtWLUiNWZXl6UebdXVwwP8Yrf+MOrSybLvMNnQWg5G0yOu0QRAX1lYHuBe1Qk0RhNeWyqbiaJv9kqRTtKiE19Jw8Hpm3zBtfj3umfAnnhOUi2S7xprpA27cZGvF4ZPfN4elwQogqMtC1/sKy3B+rQ/Wb6OvwyN5L+zZEEfION21CdQPCmpZ8WQno+JPtLGZqaXfKJ9jy9C1I5i9Vek/Qk7gH7GLcgkvM7x8Je0cWXQFtbPHJct1o0AN06e/d6v4ZIC6m8+bIFo5w0junc3r3wj2YRNVdP/36YqtSaNXiwuBOB8MtnLWm78Apx3QzND5 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 05, 2024 at 02:29:35PM +0200, Leon Romanovsky wrote: > > > These advanced DMA mapping APIs are needed to calculate IOVA size to > > > allocate it as one chunk and some sort of offset calculations to know > > > which part of IOVA to map. > > > > I don't follow this part at all - at *some* point, something must know a > > range of memory addresses involved in a DMA transfer, so that's where it > > should map that range for DMA. > > In all presented cases in this series, the overall DMA size is known in > advance. In RDMA case, it is known when user registers the memory, in > VFIO, when live migration is happening and in NVMe, when BIO is created. > > So once we allocated IOVA, we will need to link ranges, which si the > same as map but without IOVA allocation. But above you say: "These advanced DMA mapping APIs are needed to calculate IOVA size to allocate it as one chunk and some sort of offset calculations to know which part of IOVA to map." this suggests you need helpers to calculate the len and offset. I can't see where that would ever make sense. The total transfer size should just be passed in by the callers and be known, and there should be no offset. > > > Instead of teaching DMA to know these specific datatypes, let's separate > > > existing DMA mapping routine to two steps and give an option to advanced > > > callers (subsystems) perform all calculations internally in advance and > > > map pages later when it is needed. > > > > From a brief look, this is clearly an awkward reinvention of the IOMMU API. > > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address > > spaces then they can and should use the IOMMU API. Perhaps there's room for > > some quality-of-life additions to the IOMMU API to help with common usage > > patterns, but the generic DMA mapping API is absolutely not the place for > > it. > > DMA mapping gives nice abstraction from IOMMU, and allows us to have > same flow for IOMMU and non-IOMMU flows without duplicating code, while > you suggest to teach almost every part in the kernel to know about IOMMU. Except that the flows are fundamentally different for the "can coalesce" vs "can't coalesce" case. In the former we have one dma_addr_t range, and in the latter as many as there are input vectors (this is ignoring the weird iommu merging case where we we coalesce some but not all segments, but I'd rather not have that in a new API). So if we want to efficiently be able to handle these cases we need two APIs in the driver and a good framework to switch between them. Robins makes a point here that the iommu API handles the can coalesce case and he has a point as that's exactly how the IOMMU API works. I'd still prefer to wrap it with dma callers to handle things like swiotlb and maybe Xen grant tables and to avoid the type confusion between dma_addr_t and then untyped iova in the iommu layer, but having this layer or not is probably worth a discussion. > > In this series, we changed RDMA, VFIO and NVMe, and in all cases we > removed more code than added. From what I saw, VDPA and virito-blk will > benefit from proposed API too. > > Even in this RFC, where Chaitanya did partial job and didn't convert > whole driver, the gain is pretty obvious: > https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u > I have no idea how that nvme patch is even supposed to work. It removes the PRP path in nvme-pci, which not only is the most common I/O path but actually required for the admin queue as NVMe doesn't support SGLs for the admin queue.