* [PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax
[not found] <20231206210252.52107-1-john@jagalactic.com>
@ 2023-12-06 21:02 ` John Groves
[not found] ` <20231206210252.52107-2-john@jagalactic.com>
` (3 subsequent siblings)
4 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:02 UTC (permalink / raw
To: Dan Williams, John Groves, John Groves
Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, John Groves
From: John Groves <john@groves.net>
This patch set is not intended to be merged; I'm hoping to get some
clarification as to the correct approach (especialy from Dan).
This work is related to famfs, which is a dax file system for shared
fabric-attached memory (FAM). Famfs is "coming soon" as an RFC, but
the concept and requirements were presented at LPC 2023. See
https://lpc.events/event/17/contributions/1455/ and
https://www.youtube.com/watch?v=aA_DgO95gLo. My expectation is that
a future (fully working) version of this patch will be folded into the
famfs
patches.
Unlike the current set of fs-dax file systems, famfs does not need a block
(pmem) device, and should really run on a /dev/dax character device since
that's how sharable fabric-attached cxl memory will surface. But
/dev/dax character devices are missing some functionality that is provided
by the block /dev/pmem driver - specifically struct dax_operations pointer
in struct dax_device.
This patch, when CONFIG_DEV_DAX_IOMAP=y, populates dax_dev->ops for
character dax devices. The added operations differ (currently) from
/dev/pmem's dev_dax->ops in that they don't use memremap but instead
provide a physical address in response to the dev_dax->direct_access()
method.
The dax_operations are direct_access() (which resolves a dax dev offset
to an address), zero_page_range() and recovery_write(). I'm not sure yet
how to test the latter two, but the direct_access() method works in
conjunciton with famfs - but only for mmaped files.
But Posix reads fail. Specifically dax_iomap_iter() calls
dax_copy_to_iter(), which declines to copy the data for some reason in
one of the lower level copy_to_user variants. I've tried to isolate the
reason for the failure with a VM under gdb, but no luck there yet. I'm
wondering if there is some flag or attribute that needs to be applied to
these addresses/pages somehow to allow this to work.
The failing copy_to_user() differs from the same path with pmem fs-dax,
in that pmem does a memremap (which I think generates one contiguous
range, even if the device had more than one range - is this right, and
does this mean it's consuming some of the vmap/vmalloc range?)
I spent some time attempting a memremap, but I haven't figured out the
magic for that. However, I like the simplicity of resolving to phys if
that's not a non-starter for some reason.
I hope this is enough context for a meaningful review and suggestions as to
what a working dev_dax->dax_operations implementation should look like.
Thanks for any tips!
John Groves (4):
Add add_dax_ops() func for fs-dax to provide dax holder_ops
Temporary hacks due to linkage issues
Add dax_operations to /dev/dax struct dax_device
Add CONFIG_DEV_DAX_IOMAP kernel build parameter
drivers/dax/Kconfig | 6 ++
drivers/dax/bus.c | 155 ++++++++++++++++++++++++++++++++++++++++++++
drivers/dax/super.c | 16 +++++
include/linux/dax.h | 5 ++
4 files changed, 182 insertions(+)
--
2.40.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops
[not found] ` <20231206210252.52107-2-john@jagalactic.com>
@ 2023-12-06 21:03 ` John Groves
0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
To: Dan Williams, John Groves, John Groves
Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, John Groves
From: John Groves <john@groves.net>
This is clearly not the right way to set the holder_ops; where &
how should this be done?
---
drivers/dax/super.c | 16 ++++++++++++++++
include/linux/dax.h | 5 +++++
2 files changed, 21 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..3d4e205c1854 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -467,6 +467,22 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
}
EXPORT_SYMBOL_GPL(alloc_dax);
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/* famfs calls this to add the holder_ops. There is probably a more elegant approach */
+int add_dax_ops(
+ struct dax_device *dax_dev,
+ const struct dax_holder_operations *hops)
+{
+ /* dax_dev->ops should have been populated by devm_create_dev_dax() */
+ WARN_ON(!dax_dev->ops);
+
+ /* Use cmpxchg? */
+ dax_dev->holder_ops = hops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(add_dax_ops);
+#endif /* DEV_DAX_IOMAP */
+
void put_dax(struct dax_device *dax_dev)
{
if (!dax_dev)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..31b68667b3cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,11 @@ struct dax_holder_operations {
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int add_dax_ops(struct dax_device *dax_dev,
+ const struct dax_holder_operations *hops);
+#endif
void *dax_holder(struct dax_device *dax_dev);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues
[not found] ` <20231206210252.52107-3-john@jagalactic.com>
@ 2023-12-06 21:03 ` John Groves
0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
To: Dan Williams, John Groves, John Groves
Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, John Groves
From: John Groves <john@groves.net>
These are functions that should be called from outside, but I had
linkage issues and did this hack instead. Will fix in the "real"
patches...
---
drivers/dax/bus.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1659b787b65f..1b55fd7aabaf 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1324,6 +1324,56 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
};
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * This is write_pmem() from pmem.c
+ */
+static void write_dax(void *pmem_addr, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ unsigned int chunk;
+ void *mem;
+
+ while (len) {
+ mem = kmap_atomic(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+ memcpy_flushcache(pmem_addr, mem + off, chunk);
+ kunmap_atomic(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += chunk;
+ }
+}
+
+/*
+ * This function is from drivers/dax/device.c
+ * For some reason EXPORT_SYMBOL(dax_pgoff_to_phys) didn't result in linkable code
+ */
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t phys;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+ if (phys + size - 1 <= range->end)
+ return phys;
+ break;
+ }
+ return -1;
+}
+
+
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device
[not found] ` <20231206210252.52107-4-john@jagalactic.com>
@ 2023-12-06 21:03 ` John Groves
0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
To: Dan Williams, John Groves, John Groves
Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, John Groves
From: John Groves <john@groves.net>
This is the primary content of this rfc. Notes about this commit:
* These methods are based somewhat loosely on pmem_dax_ops from
drivers/nvdimm/pmem.c
* dev_dax_direct_access() is physaddr based
* dev_dax_direct_access() works for mmap, but fails dax_copy_to_iter()
on posix read
* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
tested yet. I'm looking for suggestions as to how to test those.
I'm hoping somebody (Dan?) can point the way to getting this working
with posix I/O. Does this need to go the memremap route?
Thanks,
John
---
drivers/dax/bus.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 105 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1b55fd7aabaf..8f8c2991c7c2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
#include "dax-private.h"
#include "bus.h"
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#endif
+
static DEFINE_MUTEX(dax_bus_lock);
#define DAX_NAME_LEN 30
@@ -1374,6 +1380,100 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
}
+/* the phys address approach */
+long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ pfn_t *pfn)
+{
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
+ size_t size = nr_pages << PAGE_SHIFT;
+ size_t offset = pgoff << PAGE_SHIFT;
+ long range_remainder = 0;
+ phys_addr_t phys;
+ int i;
+
+ /*
+ * pmem hides dax ranges by mapping to a contiguous
+ * pmem->virt_addr = devm_mremap_pages() (in pem_attach_disk()).
+ * Is it legal to avoid the vmap overhead (and resource consumption) and just return
+ * a (potentially partial) phys range? This function does this, returning the
+ * phys_addr with the length truncated if necessary to the range remainder
+ */
+ phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+ if (kaddr)
+ *kaddr = (void *)phys;
+
+ if (pfn)
+ *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); /* are flags correct? */
+
+ /*
+ * If dax_pgoff_to_phys() also returned the range remainder (range_len - range_offset)
+ * this loop would not be necessary
+ */
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ size_t rlen = range_len(&(dev_dax->ranges[i].range));
+
+ if (offset < rlen) {
+ range_remainder = rlen - offset;
+ break;
+ }
+ offset -= rlen;
+ }
+
+ /*
+ * Return length valid at phys. Hoping callers can deal with len < entire_dax_device
+ * (or < npages). This returns the remaining length in the applicable dax region.
+ */
+ return PHYS_PFN(min_t(size_t, range_remainder, size));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+ size_t nr_pages)
+{
+ long resid = nr_pages << PAGE_SHIFT;
+ long offset = pgoff << PAGE_SHIFT;
+
+ /* Break into one write per dax region */
+ while (resid > 0) {
+ void *kaddr;
+ pgoff_t poff = offset >> PAGE_SHIFT;
+ long len = __dev_dax_direct_access(dax_dev, poff,
+ nr_pages, DAX_ACCESS, &kaddr, NULL);
+ len = min_t(long, len, PAGE_SIZE);
+ write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+ offset += len;
+ resid -= len;
+ }
+ return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, pfn_t *pfn)
+{
+ return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ size_t len, off;
+
+ off = offset_in_page(addr);
+ len = PFN_PHYS(PFN_UP(off + bytes));
+
+ return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+ .direct_access = dev_dax_direct_access,
+ .zero_page_range = dev_dax_zero_page_range,
+ .recovery_write = dev_dax_recovery_write,
+};
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
@@ -1429,11 +1529,16 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
}
}
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+ /* holder_ops currently populated separately in a slightly hacky way */
+ dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+#else
/*
* No dax_operations since there is no access to this device outside of
* mmap of the resulting character device.
*/
dax_dev = alloc_dax(dev_dax, NULL);
+#endif
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
goto err_alloc_dax;
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
[not found] ` <20231206210252.52107-5-john@jagalactic.com>
@ 2023-12-06 21:03 ` John Groves
0 siblings, 0 replies; 5+ messages in thread
From: John Groves @ 2023-12-06 21:03 UTC (permalink / raw
To: Dan Williams, John Groves, John Groves
Cc: Vishal Verma, Dave Jiang, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, John Groves
From: John Groves <john@groves.net>
Add CONFIG_DEV_DAXIOMAP kernel build parameter
---
drivers/dax/Kconfig | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
Say N if unsure.
+config DEV_DAX_IOMAP
+ depends on DEV_DAX && DAX
+ def_bool y
+ help
+ Support iomap mapping of devdax devices (for FS-DAX file
+ systems that reside on character /dev/dax devices)
endif
--
2.40.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-12-06 21:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20231206210252.52107-1-john@jagalactic.com>
2023-12-06 21:02 ` [PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax John Groves
[not found] ` <20231206210252.52107-2-john@jagalactic.com>
2023-12-06 21:03 ` [PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops John Groves
[not found] ` <20231206210252.52107-3-john@jagalactic.com>
2023-12-06 21:03 ` [PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues John Groves
[not found] ` <20231206210252.52107-4-john@jagalactic.com>
2023-12-06 21:03 ` [PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device John Groves
[not found] ` <20231206210252.52107-5-john@jagalactic.com>
2023-12-06 21:03 ` [PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).