[PATCH v4 00/12] mm/swap: clean up and optimize swap cache index

Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Neil Brown <neilb@suse.de>,
	Minchan Kim <minchan@kernel.org>, Hugh Dickins <hughd@google.com>,
	David Hildenbrand <david@redhat.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v4 00/12] mm/swap: clean up and optimize swap cache index
Date: Thu,  2 May 2024 16:45:57 +0800	[thread overview]
Message-ID: <20240502084609.28376-1-ryncsn@gmail.com> (raw)

From: Kairui Song <kasong@tencent.com>

This is based on latest mm-unstable. Patch 1/12 is not needed if
f2fs converted .readahead to use folio, I included it for easier test
and review.

Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller files inside a
swap device. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap device. This is OK for
correctness, as the offset (key) is unique.

But Xarray is specially optimized for small indexes, it creates the
redix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.

For 64M chunk it should only take at most 3 level to contain everything.
But if we are using the offset from the whole swap device, the offset (key)
value will be way beyond 64M, and so will the tree level.

Optimize this by reduce the swap cache search space into 64M scope.

Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):

Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps

After (+1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps

Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps

After (+0.8% faster):
94834.91 qps

There is alse a very slight drop of radix tree node slab usage:
Before: 303952K
After:  302224K

For this series:

There are multiple places that expect mixed type of pages (page cache or
swap cache), eg. migration, huge memory split; There are four helpers
for that:

- page_index
- page_file_offset
- folio_index
- folio_file_pos

To keep the code clean and compatible, this series first cleaned up
usage of them.

page_file_offset and folio_file_pos are historical helpes that can
be simply dropped after clean up. And page_index can be all converted to
folio_index or folio->index.

Then introduce two new helpers swap_cache_index and swap_dev_pos
for swap. Replace swp_offset with swap_cache_index when used to
retrieve folio from swap cache, and use swap_dev_pos when needed
to retrieve the device position of a swap entry. This way,
swap_cache_index can return the optimized value with no compatibility
issue.

The result is better performance and reduced LOC.

Idealy, in the future, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT
from 14 to 12: Default Xarray chunk offset is 6, so we have 3 level
trees instead of 2 level trees just for 2 extra bits. But swap cache
is based on address_space struct, with 4 times more metadata sparsely
distributed in memory it waste more cacheline, the performance gain
from this series is almost canceled according to my test. So first,
just have a cleaner seperation of offsets and smaller search space.

Patch 1/12 - 11/12: Clean up usage of above helpers.
Patch 12/12: Apply the optmization.

V3: https://lore.kernel.org/all/20240429190500.30979-1-ryncsn@gmail.com/
Update from V3:
- Help remove a redundant loop in nilfs2 [Matthew Wilcox]
- Update commit message, use the term swap device instead of swap file
  to avoid confusion [Huang, Ying]
- Add more details in commit message about folio_file_pos usage in NFS.
- Fix a shadow leak in clear_shadow_from_swap_cache.

V2: https://lore.kernel.org/linux-mm/20240423170339.54131-1-ryncsn@gmail.com/
Update from V2:
- Clean up usage of page_file_offset and folio_file_pos [Matthew Wilcox]
  https://lore.kernel.org/linux-mm/ZiiFHTwgu8FGio1k@casper.infradead.org/
- Use folio in nilfs_bmap_data_get_key [Ryusuke Konishi]

V1: https://lore.kernel.org/all/20240417160842.76665-1-ryncsn@gmail.com/
Update from V1:
- Convert more users to use folio directly when possible [Matthew Wilcox]
- Rename swap_file_pos to swap_dev_pos [Huang, Ying]
- Update comments and commit message.
- Adjust headers and add dummy function to fix build error.

This series is part of effort to reduce swap cache overhead, and ultimately
remove SWP_SYNCHRONOUS_IO and unify swap cache usage as proposed before:
https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/

Kairui Song (12):
  f2fs: drop usage of page_index
  nilfs2: drop usage of page_index
  ceph: drop usage of page_index
  NFS: remove nfs_page_lengthg and usage of page_index
  cifs: drop usage of page_file_offset
  afs: drop usage of folio_file_pos
  netfs: drop usage of folio_file_pos
  nfs: drop usage of folio_file_pos
  mm/swap: get the swap device offset directly
  mm: remove page_file_offset and folio_file_pos
  mm: drop page_index and convert folio_index to use folio
  mm/swap: reduce swap cache search space

 fs/afs/dir.c              |  6 +++---
 fs/afs/dir_edit.c         |  4 ++--
 fs/ceph/dir.c             |  2 +-
 fs/ceph/inode.c           |  2 +-
 fs/f2fs/data.c            |  2 +-
 fs/netfs/buffered_read.c  |  4 ++--
 fs/netfs/buffered_write.c |  2 +-
 fs/nfs/file.c             |  2 +-
 fs/nfs/internal.h         | 19 -------------------
 fs/nfs/nfstrace.h         |  4 ++--
 fs/nfs/write.c            |  6 +++---
 fs/nilfs2/bmap.c          | 10 ++--------
 fs/smb/client/file.c      |  2 +-
 include/linux/mm.h        | 13 -------------
 include/linux/pagemap.h   | 25 ++++---------------------
 mm/huge_memory.c          |  2 +-
 mm/memcontrol.c           |  2 +-
 mm/mincore.c              |  2 +-
 mm/page_io.c              |  6 +++---
 mm/shmem.c                |  2 +-
 mm/swap.h                 | 24 ++++++++++++++++++++++++
 mm/swap_state.c           | 19 ++++++++++---------
 mm/swapfile.c             | 11 +++++------
 23 files changed, 70 insertions(+), 101 deletions(-)

-- 
2.44.0

next             reply	other threads:[~2024-05-02  8:47 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-02  8:45 Kairui Song [this message]
2024-05-02  8:45 ` [PATCH v4 01/12] f2fs: drop usage of page_index Kairui Song
2024-05-02  8:45 ` [PATCH v4 02/12] nilfs2: " Kairui Song
2024-05-02 11:08   ` Ryusuke Konishi
2024-05-02  8:46 ` [PATCH v4 03/12] ceph: " Kairui Song
2024-05-02  8:46 ` [PATCH v4 04/12] NFS: remove nfs_page_lengthg and " Kairui Song
2024-05-02  8:46 ` [PATCH v4 05/12] cifs: drop usage of page_file_offset Kairui Song
2024-05-02  8:46 ` [PATCH v4 06/12] afs: drop usage of folio_file_pos Kairui Song
2024-05-02  8:46 ` [PATCH v4 07/12] netfs: " Kairui Song
2024-05-02  8:49 ` [PATCH v4 08/12] nfs: " Kairui Song
2024-05-02  8:49 ` [PATCH v4 09/12] mm/swap: get the swap device offset directly Kairui Song
2024-05-08  6:28   ` Huang, Ying
2024-05-02  8:49 ` [PATCH v4 10/12] mm: remove page_file_offset and folio_file_pos Kairui Song
2024-05-08  6:32   ` Huang, Ying
2024-05-02  8:49 ` [PATCH v4 11/12] mm: drop page_index and convert folio_index to use folio Kairui Song
2024-05-02  9:12   ` David Hildenbrand
2024-05-02  9:32     ` Kairui Song
2024-05-02  9:38       ` Kairui Song
2024-05-02  8:49 ` [PATCH v4 12/12] mm/swap: reduce swap cache search space Kairui Song
2024-05-08  7:25   ` Huang, Ying
2024-05-08  7:49     ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240502084609.28376-1-ryncsn@gmail.com \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).