NVDIMM Device and Persistent Memory development
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax
       [not found] <20231206210252.52107-1-john@jagalactic.com>
@ 2023-12-06 21:02 ` John Groves
       [not found] ` <20231206210252.52107-2-john@jagalactic.com>
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:02 UTC (permalink / raw
  To: Dan Williams, John Groves, John Groves
  Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, John Groves

From: John Groves <john@groves.net>

This patch set is not intended to be merged; I'm hoping to get some
clarification as to the correct approach (especialy from Dan). 

This work is related to famfs, which is a dax file system for shared
fabric-attached memory (FAM). Famfs is "coming soon" as an RFC, but
the concept and requirements were presented at LPC 2023. See
https://lpc.events/event/17/contributions/1455/ and
https://www.youtube.com/watch?v=aA_DgO95gLo. My expectation is that
a future (fully working) version of this patch will be folded into the
famfs
patches.

Unlike the current set of fs-dax file systems, famfs does not need a block
(pmem) device, and should really run on a /dev/dax character device since
that's how sharable fabric-attached cxl memory will surface. But
/dev/dax character devices are missing some functionality that is provided
by the block /dev/pmem driver - specifically struct dax_operations pointer
in struct dax_device.

This patch, when CONFIG_DEV_DAX_IOMAP=y, populates dax_dev->ops for
character dax devices. The added operations differ (currently) from
/dev/pmem's dev_dax->ops in that they don't use memremap but instead
provide a physical address in response to the dev_dax->direct_access()
method. 

The dax_operations are direct_access() (which resolves a dax dev offset
to an address), zero_page_range() and recovery_write(). I'm not sure yet
how to test the latter two, but the direct_access() method works in
conjunciton with famfs - but only for mmaped files.

But Posix reads fail. Specifically dax_iomap_iter() calls
dax_copy_to_iter(), which declines to copy the data for some reason in
one of the lower level copy_to_user variants. I've tried to isolate the
reason for the failure with a VM under gdb, but no luck there yet. I'm
wondering if there is some flag or attribute that needs to be applied to
these addresses/pages somehow to allow this to work.

The failing copy_to_user() differs from the same path with pmem fs-dax,
in that pmem does a memremap (which I think generates one contiguous
range, even if the device had more than one range - is this right, and
does this mean it's consuming some of the vmap/vmalloc range?)

I spent some time attempting a memremap, but I haven't figured out the
magic for that. However, I like the simplicity of resolving to phys if
that's not a non-starter for some reason.

I hope this is enough context for a meaningful review and suggestions as to
what a working dev_dax->dax_operations implementation should look like.

Thanks for any tips!


John Groves (4):
  Add add_dax_ops() func for fs-dax to provide dax holder_ops
  Temporary hacks due to linkage issues
  Add dax_operations to /dev/dax struct dax_device
  Add CONFIG_DEV_DAX_IOMAP kernel build parameter

 drivers/dax/Kconfig |   6 ++
 drivers/dax/bus.c   | 155 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/dax/super.c |  16 +++++
 include/linux/dax.h |   5 ++
 4 files changed, 182 insertions(+)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops
       [not found] ` <20231206210252.52107-2-john@jagalactic.com>
@ 2023-12-06 21:03   ` John Groves
  0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
  To: Dan Williams, John Groves, John Groves
  Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, John Groves

From: John Groves <john@groves.net>

This is clearly not the right way to set the holder_ops; where &
how should this be done?
---
 drivers/dax/super.c | 16 ++++++++++++++++
 include/linux/dax.h |  5 +++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..3d4e205c1854 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -467,6 +467,22 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/* famfs calls this to add the holder_ops. There is probably a more elegant approach */
+int add_dax_ops(
+	struct dax_device           *dax_dev,
+	const struct dax_holder_operations *hops)
+{
+	/* dax_dev->ops should have been populated by devm_create_dev_dax() */
+	WARN_ON(!dax_dev->ops);
+
+	/* Use cmpxchg? */
+	dax_dev->holder_ops = hops;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(add_dax_ops);
+#endif /* DEV_DAX_IOMAP */
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..31b68667b3cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,11 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int add_dax_ops(struct dax_device *dax_dev,
+		const struct dax_holder_operations *hops);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues
       [not found] ` <20231206210252.52107-3-john@jagalactic.com>
@ 2023-12-06 21:03   ` John Groves
  0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
  To: Dan Williams, John Groves, John Groves
  Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, John Groves

From: John Groves <john@groves.net>

These are functions that should be called from outside, but I had
linkage issues and did this hack instead. Will fix in the "real"
patches...
---
 drivers/dax/bus.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1659b787b65f..1b55fd7aabaf 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1324,6 +1324,56 @@ static const struct device_type dev_dax_type = {
 	.groups = dax_attribute_groups,
 };
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * This is write_pmem() from pmem.c
+ */
+static void write_dax(void *pmem_addr, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	unsigned int chunk;
+	void *mem;
+
+	while (len) {
+		mem = kmap_atomic(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+		memcpy_flushcache(pmem_addr, mem + off, chunk);
+		kunmap_atomic(mem);
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += chunk;
+	}
+}
+
+/*
+ * This function is from drivers/dax/device.c
+ * For some reason EXPORT_SYMBOL(dax_pgoff_to_phys) didn't result in linkable code
+ */
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+			      unsigned long size)
+{
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t phys;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+		if (phys + size - 1 <= range->end)
+			return phys;
+		break;
+	}
+	return -1;
+}
+
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device
       [not found] ` <20231206210252.52107-4-john@jagalactic.com>
@ 2023-12-06 21:03   ` John Groves
  0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
  To: Dan Williams, John Groves, John Groves
  Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, John Groves

From: John Groves <john@groves.net>

This is the primary content of this rfc. Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
  drivers/nvdimm/pmem.c

* dev_dax_direct_access() is physaddr based

* dev_dax_direct_access() works for mmap, but fails dax_copy_to_iter()
  on posix read

* dev_dax_recovery_write()  and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

I'm hoping somebody (Dan?) can point the way to getting this working
with posix I/O. Does this need to go the memremap route?

Thanks,
John
---
 drivers/dax/bus.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1b55fd7aabaf..8f8c2991c7c2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#endif
+
 static DEFINE_MUTEX(dax_bus_lock);
 
 #define DAX_NAME_LEN 30
@@ -1374,6 +1380,100 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 
 
+/* the phys address approach */
+long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+			     long nr_pages, enum dax_access_mode mode, void **kaddr,
+			     pfn_t *pfn)
+{
+	struct dev_dax *dev_dax = dax_get_private(dax_dev);
+	size_t size = nr_pages << PAGE_SHIFT;
+	size_t offset = pgoff << PAGE_SHIFT;
+	long range_remainder = 0;
+	phys_addr_t phys;
+	int i;
+
+	/*
+	 * pmem hides dax ranges by mapping  to a contiguous
+	 * pmem->virt_addr = devm_mremap_pages() (in pem_attach_disk()).
+	 * Is it legal to avoid the vmap overhead (and resource consumption) and just return
+	 * a (potentially partial) phys range? This function does this, returning the
+	 * phys_addr with the length truncated if necessary to the range remainder
+	 */
+	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+	if (kaddr)
+		*kaddr = (void *)phys;
+
+	if (pfn)
+		*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); /* are flags correct? */
+
+	/*
+	 * If dax_pgoff_to_phys() also returned the range remainder (range_len - range_offset)
+	 * this loop would not be necessary
+	 */
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		size_t rlen = range_len(&(dev_dax->ranges[i].range));
+
+		if (offset < rlen) {
+			range_remainder = rlen - offset;
+			break;
+		}
+		offset -= rlen;
+	}
+
+	/*
+	 * Return length valid at phys. Hoping callers can deal with len < entire_dax_device
+	 * (or < npages). This returns the remaining length in the applicable dax region.
+	 */
+	return PHYS_PFN(min_t(size_t, range_remainder, size));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+				    size_t nr_pages)
+{
+	long resid = nr_pages << PAGE_SHIFT;
+	long offset = pgoff << PAGE_SHIFT;
+
+	/* Break into one write per dax region */
+	while (resid > 0) {
+		void *kaddr;
+		pgoff_t poff = offset >> PAGE_SHIFT;
+		long len = __dev_dax_direct_access(dax_dev, poff,
+						   nr_pages, DAX_ACCESS, &kaddr, NULL);
+		len = min_t(long, len, PAGE_SIZE);
+		write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+		offset += len;
+		resid  -= len;
+	}
+	return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
+{
+	return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t len, off;
+
+	off = offset_in_page(addr);
+	len = PFN_PHYS(PFN_UP(off + bytes));
+
+	return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+	.direct_access = dev_dax_direct_access,
+	.zero_page_range = dev_dax_zero_page_range,
+	.recovery_write = dev_dax_recovery_write,
+};
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
@@ -1429,11 +1529,16 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 		}
 	}
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+	/* holder_ops currently populated separately in a slightly hacky way */
+	dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+#else
 	/*
 	 * No dax_operations since there is no access to this device outside of
 	 * mmap of the resulting character device.
 	 */
 	dax_dev = alloc_dax(dev_dax, NULL);
+#endif
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
 		goto err_alloc_dax;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
       [not found] ` <20231206210252.52107-5-john@jagalactic.com>
@ 2023-12-06 21:03   ` John Groves
  0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
  To: Dan Williams, John Groves, John Groves
  Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, John Groves

From: John Groves <john@groves.net>

Add CONFIG_DEV_DAXIOMAP kernel build parameter
---
 drivers/dax/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_IOMAP
+       depends on DEV_DAX && DAX
+       def_bool y
+       help
+         Support iomap mapping of devdax devices (for FS-DAX file
+         systems that reside on character /dev/dax devices)
 endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-12-06 21:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20231206210252.52107-1-john@jagalactic.com>
2023-12-06 21:02 ` [PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax John Groves
     [not found] ` <20231206210252.52107-2-john@jagalactic.com>
2023-12-06 21:03   ` [PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops John Groves
     [not found] ` <20231206210252.52107-3-john@jagalactic.com>
2023-12-06 21:03   ` [PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues John Groves
     [not found] ` <20231206210252.52107-4-john@jagalactic.com>
2023-12-06 21:03   ` [PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device John Groves
     [not found] ` <20231206210252.52107-5-john@jagalactic.com>
2023-12-06 21:03   ` [PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).