Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: John Groves <John@Groves.net>
To: John Groves <John@Groves.net>, Jonathan Corbet <corbet@lwn.net>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	nvdimm@lists.linux.dev
Cc: John Groves <jgroves@micron.com>,
	john@jagalactic.com, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	dave.hansen@linux.intel.com, gregory.price@memverge.com,
	Randy Dunlap <rdunlap@infradead.org>,
	Jerome Glisse <jglisse@google.com>,
	Aravind Ramesh <arramesh@micron.com>,
	Ajay Joshi <ajayjoshi@micron.com>,
	Eishan Mirakhur <emirakhur@micron.com>,
	Ravi Shankar <venkataravis@micron.com>,
	Srinivasulu Thanneeru <sthanneeru@micron.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	Chandan Babu R <chandanbabu@kernel.org>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Steve French <stfrench@microsoft.com>,
	Nathan Lynch <nathanl@linux.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Thomas Zimmermann <tzimmermann@suse.de>,
	Julien Panis <jpanis@baylibre.com>,
	Stanislav Fomichev <sdf@google.com>,
	Dongsheng Yang <dongsheng.yang@easystack.cn>,
	John Groves <john@groves.net>
Subject: [RFC PATCH v2 00/12] Introduce the famfs shared-memory file system
Date: Mon, 29 Apr 2024 12:04:16 -0500	[thread overview]
Message-ID: <cover.1714409084.git.john@groves.net> (raw)

This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
  sharable memory. The memory is exposed and accessed as memory-mappable
  dax files.
* Famfs supports multiple hosts mounting the same file system from the
  same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on a /dev/dax device in devdax mode,
  which rests on dax functionality added in patches 2-7 of this series.

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).

Contents:

* famfs kernel documentation [patch 1]. Note that evolving famfs user
  documentation is at [2]
* dev_dax_iomap patchset [patches 2-7] - This enables fs-dax to use the
  iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
  historical reasons the iomap infrastructure was enabled only for
  /dev/pmem devices (which are dax block devices). As famfs is the first
  fs-dax file system that works on /dev/dax, this patch series fills in
  the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 8-12] - this introduces the kernel component of
  famfs.

Note that there is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
  System-ram gets zeroed when it is onlined, so sharing is basically
  nonsense.
* It does not make sense to put struct page's in sharable memory, because
  those can't be shared. However, separately providing non-sharable
  capacity to be used for struct page's might be a sensible approach if the
  size of struct page array for sharable memory is too large to put in
  conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
  order to gain access to data that is already in the memory. Moreover
  the power domain for shared memory is separate for that of the server.
  Having observed that, famfs is not intended for persistent storage. It is
  intended for sharing data sets in memory during a time frame where the
  memory and the compute nodes are expected to remain operational - such
  as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

We will be discussing this topic at LSFMM 2024 [5] in a topic called "Famfs:
new userspace filesystem driver vs. improving FUSE/DAX" - but other famfs
related discussion will also be welcome!

This patch set is available as a branch at [6]

References

[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://events.linuxfoundation.org/lsfmmbpf/program/schedule-at-a-glance/
[6] https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-v2


Changes since RFC v1:


* This patch series is a from-scratch refactor of the original. The code
  that maps a file to a dax device is almost identical, but a lot of
  cleanup has been done.
* The get_tree and backing device handling code has been ripped up and
  re-done (in the get-tree case, based on suggestions from Christian
  Brauner - thanks Christian; I hope I haven't done any new dumb stuff!)
  (Note this code has been extensively tested; after all known error cases
  famfs can be umounted and the module can be unloaded)
* Famfs now 'shuts down' if the dax device reports any memory errors. I/O
  and faults start reporting SIGBUS. Famfs detects memory errors via an
  iomap_ops->notify failure call from the devdax layer. This has been tested
  and appears to disable the famfs file system while leaving it able to
  dismount cleanly.
* Dropped fault counters
* Dropped support for symlinks wtihin a famfs file system; we don't think
  supporting symlinks makes sense with famfs, and it has some undesirable
  side effects, so it's out.
* Dropped support for mknod within a famfs file system (other than regular
  files and directories)
* Famfs magic number moved to magic.h
* Famfs ioctl opcodes now documented in
  Documentation/userspace-api/ioctl/ioctl-number.rst
* Dodgy kerneldoc comments cleaned up or removed; hopefully none added...
* Kconfig formatting cleaned up
* Dropped /dev/pmem support. Prior patch series would mount on either
  /dev/pmem or /dev/dax devices. This is unnecessary complexity since
  /ddev/pmem devices can be converted to /dev/dax. Famfs is, however, the
  first file system we know of that mounts from a character device.
* Famfs no longer does a filp_open() of the dax device. It finds the
  device by its dev_t and uses fs_dax_get() to effect exclusivity.
* Added a read-only module param famfs_kabi_version for checkout
  that user space was compiled for the same ABI version
* The famfs kernel module (the code in fs/famfs plus the uapi file
  famfs_ioctl.c dropped from 1030 lines of code in v1 to 760 in v2,
  according to "cloc".
* Fixed issues reported by the kernel test robot
* Many minor improvements in response to v1 code reviews


John Groves (12):
  famfs: Introduce famfs documentation
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  famfs prep: Add fs/super.c:kill_char_super()
  famfs: module operations & fs_context
  famfs: Introduce inode_operations and super_operations
  famfs: Introduce file_operations read/write
  famfs: Introduce mmap and VM fault handling
  famfs: famfs_ioctl and core file-to-memory mapping logic & iomap_ops

 Documentation/filesystems/famfs.rst           | 135 ++++
 Documentation/filesystems/index.rst           |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  11 +
 drivers/dax/Kconfig                           |   6 +
 drivers/dax/bus.c                             | 144 ++++-
 drivers/dax/dax-private.h                     |   1 +
 drivers/dax/device.c                          |  38 +-
 drivers/dax/super.c                           |  33 +-
 fs/Kconfig                                    |   2 +
 fs/Makefile                                   |   1 +
 fs/famfs/Kconfig                              |  10 +
 fs/famfs/Makefile                             |   5 +
 fs/famfs/famfs_file.c                         | 605 ++++++++++++++++++
 fs/famfs/famfs_inode.c                        | 452 +++++++++++++
 fs/famfs/famfs_internal.h                     |  52 ++
 fs/namei.c                                    |   1 +
 fs/super.c                                    |   9 +
 include/linux/dax.h                           |   6 +
 include/linux/fs.h                            |   1 +
 include/uapi/linux/famfs_ioctl.h              |  61 ++
 include/uapi/linux/magic.h                    |   1 +
 22 files changed, 1547 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile
 create mode 100644 fs/famfs/famfs_file.c
 create mode 100644 fs/famfs/famfs_inode.c
 create mode 100644 fs/famfs/famfs_internal.h
 create mode 100644 include/uapi/linux/famfs_ioctl.h


base-commit: ed30a4a51bb196781c8058073ea720133a65596f
-- 
2.43.0


             reply	other threads:[~2024-04-29 17:04 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-29 17:04 John Groves [this message]
2024-04-29 17:04 ` [RFC PATCH v2 01/12] famfs: Introduce famfs documentation John Groves
2024-04-30  6:46   ` Bagas Sanjaya
2024-04-29 17:04 ` [RFC PATCH v2 02/12] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2024-04-29 17:04 ` [RFC PATCH v2 03/12] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2024-04-29 17:04 ` [RFC PATCH v2 04/12] dev_dax_iomap: Save the kva from memremap John Groves
2024-04-29 17:04 ` [RFC PATCH v2 05/12] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2024-04-29 17:04 ` [RFC PATCH v2 06/12] dev_dax_iomap: export dax_dev_get() John Groves
2024-04-29 17:04 ` [RFC PATCH v2 07/12] famfs prep: Add fs/super.c:kill_char_super() John Groves
2024-05-02 18:17   ` Al Viro
2024-05-02 22:25     ` John Groves
2024-05-03  9:04       ` Christian Brauner
2024-05-03 15:38         ` John Groves
2024-04-29 17:04 ` [RFC PATCH v2 08/12] famfs: module operations & fs_context John Groves
2024-04-30 11:01   ` Christian Brauner
2024-05-02 15:51     ` John Groves
2024-05-03 14:15     ` John Groves
2024-05-02 18:23   ` Al Viro
2024-05-02 21:50     ` John Groves
2024-04-29 17:04 ` [RFC PATCH v2 09/12] famfs: Introduce inode_operations and super_operations John Groves
2024-04-29 17:04 ` [RFC PATCH v2 10/12] famfs: Introduce file_operations read/write John Groves
2024-05-02 18:29   ` Al Viro
2024-05-02 21:51     ` John Groves
2024-04-29 17:04 ` [RFC PATCH v2 11/12] famfs: Introduce mmap and VM fault handling John Groves
2024-04-29 17:04 ` [RFC PATCH v2 12/12] famfs: famfs_ioctl and core file-to-memory mapping logic & iomap_ops John Groves
2024-04-29 18:32 ` [RFC PATCH v2 00/12] Introduce the famfs shared-memory file system Matthew Wilcox
2024-04-29 23:08   ` Kent Overstreet
2024-04-30  2:24     ` John Groves
2024-04-30  3:11       ` Kent Overstreet
2024-05-01  2:09         ` John Groves
2024-04-30  2:11   ` John Groves
2024-04-30 21:01     ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1714409084.git.john@groves.net \
    --to=john@groves.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayjoshi@micron.com \
    --cc=amir73il@gmail.com \
    --cc=arramesh@micron.com \
    --cc=bagasdotme@gmail.com \
    --cc=brauner@kernel.org \
    --cc=chandanbabu@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dongsheng.yang@easystack.cn \
    --cc=emirakhur@micron.com \
    --cc=gregory.price@memverge.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jgroves@micron.com \
    --cc=john@jagalactic.com \
    --cc=jpanis@baylibre.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=nathanl@linux.ibm.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=rdunlap@infradead.org \
    --cc=sdf@google.com \
    --cc=stfrench@microsoft.com \
    --cc=sthanneeru@micron.com \
    --cc=tzimmermann@suse.de \
    --cc=venkataravis@micron.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).