Linux-api Archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, bfoster@redhat.com,
	willy@infradead.org, linux-api@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH v13 2/3] cachestat: implement cachestat syscall
Date: Wed, 3 May 2023 19:25:57 -0700	[thread overview]
Message-ID: <CAKEwX=MmC-5wY2u25YY9WupGLfZrY2V=VGYAZHJqSSdzT9yO3w@mail.gmail.com> (raw)
In-Reply-To: <20230503150425.GC193380@cmpxchg.org>

On Wed, May 3, 2023 at 8:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, May 02, 2023 at 06:36:07PM -0700, Nhat Pham wrote:
> > There is currently no good way to query the page cache state of large
> > file sets and directory trees. There is mincore(), but it scales poorly:
> > the kernel writes out a lot of bitmap data that userspace has to
> > aggregate, when the user really doesn not care about per-page
> > information in that case. The user also needs to mmap and unmap each
> > file as it goes along, which can be quite slow as well.
> >
> > Some use cases where this information could come in handy:
> >   * Allowing database to decide whether to perform an index scan or
> >     direct table queries based on the in-memory cache state of the
> >     index.
> >   * Visibility into the writeback algorithm, for performance issues
> >     diagnostic.
> >   * Workload-aware writeback pacing: estimating IO fulfilled by page
> >     cache (and IO to be done) within a range of a file, allowing for
> >     more frequent syncing when and where there is IO capacity, and
> >     batching when there is not.
> >   * Computing memory usage of large files/directory trees, analogous to
> >     the du tool for disk usage.
> >
> > More information about these use cases could be found in the following
> > thread:
> >
> > https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
> >
> > This patch implements a new syscall that queries cache state of a file
> > and summarizes the number of cached pages, number of dirty pages, number
> > of pages marked for writeback, number of (recently) evicted pages, etc.
> > in a given range. Currently, the syscall is only wired in for x86
> > architecture.
> >
> > NAME
> >     cachestat - query the page cache statistics of a file.
> >
> > SYNOPSIS
> >     #include <sys/mman.h>
> >
> >     struct cachestat_range {
> >         __u64 off;
> >         __u64 len;
> >     };
> >
> >     struct cachestat {
> >         __u64 nr_cache;
> >         __u64 nr_dirty;
> >         __u64 nr_writeback;
> >         __u64 nr_evicted;
> >         __u64 nr_recently_evicted;
> >     };
> >
> >     int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
> >         struct cachestat *cstat, unsigned int flags);
> >
> > DESCRIPTION
> >     cachestat() queries the number of cached pages, number of dirty
> >     pages, number of pages marked for writeback, number of evicted
> >     pages, number of recently evicted pages, in the bytes range given by
> >     `off` and `len`.
> >
> >     An evicted page is a page that is previously in the page cache but
> >     has been evicted since. A page is recently evicted if its last
> >     eviction was recent enough that its reentry to the cache would
> >     indicate that it is actively being used by the system, and that
> >     there is memory pressure on the system.
> >
> >     These values are returned in a cachestat struct, whose address is
> >     given by the `cstat` argument.
> >
> >     The `off` and `len` arguments must be non-negative integers. If
> >     `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> >     0, we will query in the range from `off` to the end of the file.
> >
> >     The `flags` argument is unused for now, but is included for future
> >     extensibility. User should pass 0 (i.e no flag specified).
> >
> >     Currently, hugetlbfs is not supported.
> >
> >     Because the status of a page can change after cachestat() checks it
> >     but before it returns to the application, the returned values may
> >     contain stale information.
> >
> > RETURN VALUE
> >     On success, cachestat returns 0. On error, -1 is returned, and errno
> >     is set to indicate the error.
> >
> > ERRORS
> >     EFAULT cstat or cstat_args points to an invalid address.
> >
> >     EINVAL invalid flags.
> >
> >     EBADF  invalid file descriptor.
> >
> >     EOPNOTSUPP file descriptor is of a hugetlbfs file
> >
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
>
> Thanks for persisting through the pain. This looks great to me now.
>
> Like I've said before, I think this is sorely needed. The cache is
> frequently the biggest memory consumer in the system. We have a rich
> API for influencing it, but there is a glaring gap when it comes to
> introspection. It's difficult to design control loops without
> feedback. This proposes an intuitive, versatile and scalable interface
> to bridge that gap, and it integrates nicely with the existing VFS API
> for managing the cache. I would love to see this go in.
>
> I'd also love for the `mu' tool you wrote to make it into coreutils
> eventually. It would make debugging memory consumption and writeback
> issues on live systems, especially with complex and/or multiple
> workloads, so much easier.
I'd love to share this too! Let me clean it up and submit it separately.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

  reply	other threads:[~2023-05-04  2:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03  1:36 [PATCH v13 0/3] cachestat: a new syscall for page cache state of files Nhat Pham
2023-05-03  1:36 ` [PATCH v13 1/3] workingset: refactor LRU refault to expose refault recency check Nhat Pham
2023-05-03 14:31   ` Johannes Weiner
2023-05-03  1:36 ` [PATCH v13 2/3] cachestat: implement cachestat syscall Nhat Pham
2023-05-03 15:04   ` Johannes Weiner
2023-05-04  2:25     ` Nhat Pham [this message]
2023-05-04 17:26   ` Geert Uytterhoeven
2023-05-04 18:06     ` Nhat Pham
2023-05-05 20:34     ` Andrew Morton
2023-05-06 17:35       ` Arnd Bergmann
2023-05-10 23:20         ` Nhat Pham
2023-05-03  1:36 ` [PATCH v13 3/3] selftests: Add selftests for cachestat Nhat Pham
2023-05-03 15:22   ` Johannes Weiner
2023-05-11  3:21   ` Michael Ellerman
2023-05-11 19:33     ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=MmC-5wY2u25YY9WupGLfZrY2V=VGYAZHJqSSdzT9yO3w@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfoster@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).