From: Jason Gunthorpe <jgg@nvidia.com>
To: "Zeng, Oak" <oak.zeng@intel.com>
Cc: "dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Brost, Matthew" <matthew.brost@intel.com>,
"Thomas.Hellstrom@linux.intel.com"
<Thomas.Hellstrom@linux.intel.com>,
"Welty, Brian" <brian.welty@intel.com>,
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
"Bommu, Krishnaiah" <krishnaiah.bommu@intel.com>,
"Vishwanathapura,
Niranjana" <niranjana.vishwanathapura@intel.com>,
Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
Date: Wed, 24 Apr 2024 10:48:40 -0300 [thread overview]
Message-ID: <20240424134840.GJ941030@nvidia.com> (raw)
In-Reply-To: <SA1PR11MB6991EDB4351D99B4E76EBC2992112@SA1PR11MB6991.namprd11.prod.outlook.com>
On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> >
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> >
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> >
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
>
> I need your help to understand this comment. Our gpu mirrors the
> whole CPU virtual address space. It is the first design pattern in
> your previous reply (entire exclusive owner of a single device
> private page table and fully mirrors the mm page table into the
> device table.)
>
> What do you mean by "page by page"/" no API surface available to
> safely try to construct covering ranges"? As I understand it,
> hmm_range_fault take a virtual address range (defined in hmm_range
> struct), and walk cpu page table in this range. It is a range based
> API.
Yes, but invalidation is not linked to range_fault so you can get
invalidations of single pages. You are binding range_fault to
dma_map_sg but then you can't handle invalidation at all sanely.
> From your previous reply ("So I find it a quite strange that this
> RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> you are concerning why we are creating vma and register mmu interval
> notifier on the fly. Let me try to explain it. Xe_vma is a very
> fundamental concept in xe driver.
I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
and you do need to ensure the page table manipulation has an API that
is usable. "populate an entire VMA" / "invalidate an entire VMA" is
not a sufficient primitive.
> The gpu page table update, invalidation are all vma-based. This
> concept exists before this svm work. For svm, we create a 2M (the
> size is user configurable) vma during gpu page fault handler and
> register this 2M range to mmu interval notifier.
You can create VMAs, but they can't be fully populated and they should
be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
SVA, the "vma" granual is 512M.
The key thing is the VMA is just a container for the notifier and
other data structures, it doesn't insist the range be fully populated
and it must support page-by-page unmap/remap/invalidateion.
> Now I try to figure out if we don't create vma, what can we do? We
> can map one page (which contains the gpu fault address) to gpu page
> table. But that doesn't work for us because the GPU cache and TLB
> would not be performant for 4K page each time. One way to think of
> the vma is a chunk size which is good for GPU HW performance.
From this perspective invalidation is driven by the range the
invalidation callback gets, it could be a single page, but probably
bigger.
mapping is driven by the range passed to hmm_range_fault() during
fault handling, which is entirely based on your drivers prefetch
logic.
GPU TLB invalidation sequences should happen once, at the end, for any
invalidation or range_fault sequence regardless. Nothing about "gpu
vmas" should have anything to do with this.
> And the mmu notifier... if we don't register the mmu notifier on the
> fly, do we register one mmu notifier to cover the whole CPU virtual
> address space (which would be huge, e.g., 0~2^56 on a 57 bit
> machine, if we have half half user space kernel space split)? That
> won't be performant as well because for any address range that is
> unmapped from cpu program, but even if they are never touched by
> GPU, gpu driver still got a invalidation callback. In our approach,
> we only register a mmu notifier for address range that we know gpu
> would touch it.
When SVA is used something, somewhere, has to decide if a CPU VA
intersects with a HW VA.
The mmu notifiers are orginized in an interval (red/black) tree, so if
you have a huge number of them the RB search becomes very expensive.
Conversly your GPU page table is organized in a radix tree, so
detecting no-presence during invalidation is a simple radix
walk. Indeed for the obviously unused ranges it is probably a pointer
load and single de-ref to check it.
I fully expect the radix walk is much, much faster than a huge number
of 2M notifiers in the red/black tree.
Notifiers for SVA cases should be giant. If not the entire memory
space, then at least something like 512M/1G kind of size, neatly
aligned with something in your page table structure so the radix walk
can start lower in the tree automatically.
> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> >
> > No you don't.
>
> See explains above. I need your help to understand how we can do it
> without a vma (or chunk). One thing GPU driver is different from
> RDMA driver is, RDMA doesn't have device private memory so no
> migration.
I want to be very clear, there should be no interaction of your
hmm_range_fault based code with MM centric VMAs. You MUST NOT look up
the CPU vma_area_struct in your driver.
> It only need to dma-map the system memory pages and use
> them to fill RDMA page table. so RDMA don't need another memory
> manager such as our buddy. RDMA only deal with system memory which
> is completely struct page based management. Page by page make 100 %
> sense for RDMA.
I don't think this is the issue at hand, you just have some historical
poorly designed page table manipulation code from what I can
understand..
Jason
next prev parent reply other threads:[~2024-04-24 13:48 UTC|newest]
Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
2024-04-05 0:39 ` Jason Gunthorpe
2024-04-05 3:33 ` Zeng, Oak
2024-04-05 12:37 ` Jason Gunthorpe
2024-04-05 16:42 ` Zeng, Oak
2024-04-05 18:02 ` Jason Gunthorpe
2024-04-09 16:45 ` Zeng, Oak
2024-04-09 17:24 ` Jason Gunthorpe
2024-04-23 21:17 ` Zeng, Oak
2024-04-24 2:31 ` Matthew Brost
2024-04-24 13:57 ` Jason Gunthorpe
2024-04-24 16:35 ` Matthew Brost
2024-04-24 16:44 ` Jason Gunthorpe
2024-04-24 16:56 ` Matthew Brost
2024-04-24 17:48 ` Jason Gunthorpe
2024-04-24 13:48 ` Jason Gunthorpe [this message]
2024-04-24 23:59 ` Zeng, Oak
2024-04-25 1:05 ` Jason Gunthorpe
2024-04-26 9:55 ` Thomas Hellström
2024-04-26 12:00 ` Jason Gunthorpe
2024-04-26 14:49 ` Thomas Hellström
2024-04-26 16:35 ` Jason Gunthorpe
2024-04-29 8:25 ` Thomas Hellström
2024-04-30 17:30 ` Jason Gunthorpe
2024-04-30 18:57 ` Daniel Vetter
2024-05-01 0:09 ` Jason Gunthorpe
2024-05-02 8:04 ` Daniel Vetter
2024-05-02 9:11 ` Thomas Hellström
2024-05-02 12:46 ` Jason Gunthorpe
2024-05-02 15:01 ` Thomas Hellström
2024-05-02 19:25 ` Zeng, Oak
2024-05-03 13:37 ` Jason Gunthorpe
2024-05-03 14:43 ` Zeng, Oak
2024-05-03 16:28 ` Jason Gunthorpe
2024-05-03 20:29 ` Zeng, Oak
2024-05-04 1:03 ` Dave Airlie
2024-05-06 13:04 ` Daniel Vetter
2024-05-06 23:50 ` Matthew Brost
2024-05-07 11:56 ` Jason Gunthorpe
2024-05-06 13:33 ` Jason Gunthorpe
2024-04-09 17:33 ` Matthew Brost
2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
2024-01-23 2:06 ` Welty, Brian
2024-01-23 3:09 ` Zeng, Oak
2024-01-23 3:21 ` Making drm_gpuvm work across gpu devices Zeng, Oak
2024-01-23 11:13 ` Christian König
2024-01-23 19:37 ` Zeng, Oak
2024-01-23 20:17 ` Felix Kuehling
2024-01-25 1:39 ` Zeng, Oak
2024-01-23 23:56 ` Danilo Krummrich
2024-01-24 3:57 ` Zeng, Oak
2024-01-24 4:14 ` Zeng, Oak
2024-01-24 6:48 ` Christian König
2024-01-25 22:13 ` Danilo Krummrich
2024-01-24 8:33 ` Christian König
2024-01-25 1:17 ` Zeng, Oak
2024-01-25 1:25 ` David Airlie
2024-01-25 5:25 ` Zeng, Oak
2024-01-26 10:09 ` Christian König
2024-01-26 20:13 ` Zeng, Oak
2024-01-29 10:10 ` Christian König
2024-01-29 20:09 ` Zeng, Oak
2024-01-25 11:00 ` 回复:Making " 周春明(日月)
2024-01-25 17:00 ` Zeng, Oak
2024-01-25 17:15 ` Making " Felix Kuehling
2024-01-25 18:37 ` Zeng, Oak
2024-01-26 13:23 ` Christian König
2024-01-25 16:42 ` Zeng, Oak
2024-01-25 18:32 ` Daniel Vetter
2024-01-25 21:02 ` Zeng, Oak
2024-01-26 8:21 ` Thomas Hellström
2024-01-26 12:52 ` Christian König
2024-01-27 2:21 ` Zeng, Oak
2024-01-29 10:19 ` Christian König
2024-01-30 0:21 ` Zeng, Oak
2024-01-30 8:39 ` Christian König
2024-01-30 22:29 ` Zeng, Oak
2024-01-30 23:12 ` David Airlie
2024-01-31 9:15 ` Daniel Vetter
2024-01-31 20:17 ` Zeng, Oak
2024-01-31 20:59 ` Zeng, Oak
2024-02-01 8:52 ` Christian König
2024-02-29 18:22 ` Zeng, Oak
2024-03-08 4:43 ` Zeng, Oak
2024-03-08 10:07 ` Christian König
2024-01-30 8:43 ` Thomas Hellström
2024-01-29 15:03 ` Felix Kuehling
2024-01-29 15:33 ` Christian König
2024-01-29 16:24 ` Felix Kuehling
2024-01-29 16:28 ` Christian König
2024-01-29 17:52 ` Felix Kuehling
2024-01-29 19:03 ` Christian König
2024-01-29 20:24 ` Felix Kuehling
2024-02-23 20:12 ` Zeng, Oak
2024-02-27 6:54 ` Christian König
2024-02-27 15:58 ` Zeng, Oak
2024-02-28 19:51 ` Zeng, Oak
2024-02-29 9:41 ` Christian König
2024-02-29 16:05 ` Zeng, Oak
2024-02-29 17:12 ` Thomas Hellström
2024-03-01 7:01 ` Christian König
2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240424134840.GJ941030@nvidia.com \
--to=jgg@nvidia.com \
--cc=Thomas.Hellstrom@linux.intel.com \
--cc=brian.welty@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=krishnaiah.bommu@intel.com \
--cc=leon@kernel.org \
--cc=matthew.brost@intel.com \
--cc=niranjana.vishwanathapura@intel.com \
--cc=oak.zeng@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).