All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Ralph Campbell <rcampbell@nvidia.com>,
	Alex Sierra <alex.sierra@amd.com>,
	 akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org,  linux-xfs@vger.kernel.org, "Yang,
	Philip" <Philip.Yang@amd.com>
Cc: jglisse@redhat.com, jgg@nvidia.com,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	hch@lst.de
Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 28 Jun 2021 12:46:59 -0400	[thread overview]
Message-ID: <cecd2164-ebfd-382b-af73-992cdc4304b7@amd.com> (raw)
In-Reply-To: <7163dbb6-67b5-6eef-5772-500fd2107e5c@nvidia.com>


Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 44 -----------------
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
>   dev_dax_fault()
>     dev_dax_huge_fault()
>       __dev_dax_pte_fault()
>         vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
  Felix


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Felix Kuehling <felix.kuehling@amd.com>
To: Ralph Campbell <rcampbell@nvidia.com>,
	Alex Sierra <alex.sierra@amd.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, "Yang,
	Philip" <Philip.Yang@amd.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	hch@lst.de, jgg@nvidia.com, jglisse@redhat.com
Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 28 Jun 2021 12:46:59 -0400	[thread overview]
Message-ID: <cecd2164-ebfd-382b-af73-992cdc4304b7@amd.com> (raw)
In-Reply-To: <7163dbb6-67b5-6eef-5772-500fd2107e5c@nvidia.com>


Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 44 -----------------
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
>   dev_dax_fault()
>     dev_dax_huge_fault()
>       __dev_dax_pte_fault()
>         vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
  Felix



WARNING: multiple messages have this Message-ID (diff)
From: Felix Kuehling <felix.kuehling@amd.com>
To: Ralph Campbell <rcampbell@nvidia.com>,
	Alex Sierra <alex.sierra@amd.com>,
	 akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org,  linux-xfs@vger.kernel.org, "Yang,
	Philip" <Philip.Yang@amd.com>
Cc: jglisse@redhat.com, jgg@nvidia.com,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	hch@lst.de
Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
Date: Mon, 28 Jun 2021 12:46:59 -0400	[thread overview]
Message-ID: <cecd2164-ebfd-382b-af73-992cdc4304b7@amd.com> (raw)
In-Reply-To: <7163dbb6-67b5-6eef-5772-500fd2107e5c@nvidia.com>


Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 44 -----------------
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
>   dev_dax_fault()
>     dev_dax_huge_fault()
>       __dev_dax_pte_fault()
>         vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
  Felix



  reply	other threads:[~2021-06-28 16:47 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17 15:16 [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_* Alex Sierra
2021-06-17 15:16 ` Alex Sierra
2021-06-17 15:16 ` Alex Sierra
2021-06-17 15:16 ` [PATCH v3 1/8] ext4/xfs: add page refcount helper Alex Sierra
2021-06-17 15:16   ` Alex Sierra
2021-06-17 15:16   ` Alex Sierra
2021-06-17 16:11   ` Darrick J. Wong
2021-06-17 16:11     ` Darrick J. Wong
2021-06-17 16:11     ` Darrick J. Wong
2021-06-17 23:52   ` Dave Chinner
2021-06-17 23:52     ` Dave Chinner
2021-06-17 23:52     ` Dave Chinner
2021-06-19 16:14     ` Sierra Guiza, Alejandro (Alex)
2021-06-19 16:14       ` Sierra Guiza, Alejandro (Alex)
2021-06-19 16:14       ` Sierra Guiza, Alejandro (Alex)
2021-06-19 18:31   ` Alex Sierra
2021-06-19 18:31     ` Alex Sierra
2021-06-19 18:31     ` Alex Sierra
2021-06-17 15:16 ` [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-06-17 15:16   ` Alex Sierra
2021-06-17 15:16   ` Alex Sierra
2021-06-17 19:16   ` Ralph Campbell
2021-06-17 19:16     ` Ralph Campbell
2021-06-17 19:16     ` Ralph Campbell
2021-06-28 16:46     ` Felix Kuehling [this message]
2021-06-28 16:46       ` Felix Kuehling
2021-06-28 16:46       ` Felix Kuehling
2021-06-30  0:23       ` Ralph Campbell
2021-06-30  0:23         ` Ralph Campbell
2021-06-30  0:23         ` Ralph Campbell
2021-06-17 15:17 ` [PATCH v3 3/8] kernel: resource: lookup_resource as exported symbol Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17 ` [PATCH v3 4/8] drm/amdkfd: add SPM support for SVM Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17 ` [PATCH v3 5/8] drm/amdkfd: generic type as sys mem on migration to ram Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17 ` [PATCH v3 6/8] include/linux/mm.h: helpers to check zone device generic type Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17 ` [PATCH v3 7/8] mm: add generic type support to migrate_vma helpers Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17 ` [PATCH v3 8/8] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:17   ` Alex Sierra
2021-06-17 15:56 ` [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_* Sierra Guiza, Alejandro (Alex)
2021-06-17 15:56   ` Sierra Guiza, Alejandro (Alex)
2021-06-17 15:56   ` Sierra Guiza, Alejandro (Alex)
2021-06-20 14:14 ` Theodore Ts'o
2021-06-20 14:14   ` Theodore Ts'o
2021-06-20 14:14   ` Theodore Ts'o
2021-06-23 21:49   ` Felix Kuehling
2021-06-23 21:49     ` Felix Kuehling
2021-06-23 21:49     ` Felix Kuehling
2021-06-24  5:30     ` Christoph Hellwig
2021-06-24  5:30       ` Christoph Hellwig
2021-06-24 15:08       ` Felix Kuehling
2021-06-24 15:08         ` Felix Kuehling
2021-06-24 15:08         ` Felix Kuehling
2021-07-16 15:07     ` Theodore Y. Ts'o
2021-07-16 15:07       ` Theodore Y. Ts'o
2021-07-16 15:07       ` Theodore Y. Ts'o
2021-07-16 22:14       ` Felix Kuehling
2021-07-16 22:14         ` Felix Kuehling
2021-07-16 22:14         ` Felix Kuehling
2021-07-17 19:54         ` Sierra Guiza, Alejandro (Alex)
2021-07-17 19:54           ` Sierra Guiza, Alejandro (Alex)
2021-07-17 19:54           ` Sierra Guiza, Alejandro (Alex)
2021-07-23 22:46           ` Sierra Guiza, Alejandro (Alex)
2021-07-23 22:46             ` Sierra Guiza, Alejandro (Alex)
2021-07-23 22:46             ` Sierra Guiza, Alejandro (Alex)
2021-07-30 19:02             ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cecd2164-ebfd-382b-af73-992cdc4304b7@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rcampbell@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.