* [PATCH v4 00/16] fanotify: add pre-content hooks
@ 2024-08-14 21:25 Josef Bacik
2024-08-14 21:25 ` [PATCH v4 01/16] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
` (16 more replies)
0 siblings, 17 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
v3->v4:
- Trying to send a final verson Friday at 5pm before you go on vacation is a
recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
review.
- Reworked the file system helper so it's handling of fpin was a little less
silly, per Chinner's suggestion.
- Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
in filemap_fault that says if VM_FAULT_ERROR is set we won't have
VM_FAULT_RETRY set.
v2->v3:
- Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
Amir's suggestion.
- Reworked the exported helper so the per-filesystem changes are much smaller,
per Amir's suggestion.
- Fixed the screwup for DAX writes per Chinner's suggestion.
- Added Christian's reviewed-by's where appropriate.
v1->v2:
- reworked the page fault logic based on Jan's suggestion and turned it into a
helper.
- Added 3 patches per-fs where we need to call the fsnotify helper from their
->fault handlers.
- Disabled readahead in the case that there's a pre-content watch in place.
- Disabled huge faults when there's a pre-content watch in place (entirely
because it's untested, theoretically it should be straightforward to do).
- Updated the command numbers.
- Addressed the random spelling/grammer mistakes that Jan pointed out.
- Addressed the other random nits from Jan.
--- Original email ---
Hello,
These are the patches for the bare bones pre-content fanotify support. The
majority of this work is Amir's, my contribution to this has solely been around
adding the page fault hooks, testing and validating everything. I'm sending it
because Amir is traveling a bunch, and I touched it last so I'm going to take
all the hate and he can take all the credit.
There is a PoC that I've been using to validate this work, you can find the git
repo here
https://github.com/josefbacik/remote-fetch
This consists of 3 different tools.
1. populate. This just creates all the stub files in the directory from the
source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
recursively create all of the stub files and directories.
2. remote-fetch. This is the actual PoC, you just point it at the source and
destination directory and then you can do whatever. ./remote-fetch ~/linux
~/hsm-linux.
3. mmap-validate. This was to validate the pagefault thing, this is likely what
will be turned into the selftest with remote-fetch. It creates a file and
then you can validate the file matches the right pattern with both normal
reads and mmap. Normally I do something like
./mmap-validate create ~/src/foo
./populate ~/src ~/dst
./rmeote-fetch ~/src ~/dst
./mmap-validate validate ~/dst/foo
I did a bunch of testing, I also got some performance numbers. I copied a
kernel tree, and then did remote-fetch, and then make -j4
Normal
real 9m49.709s
user 28m11.372s
sys 4m57.304s
HSM
real 10m6.454s
user 29m10.517s
sys 5m2.617s
So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
to see the size
[root@fedora ~]# du -hs /src/linux
1.6G /src/linux
[root@fedora ~]# du -hs dst
125M dst
This mirrors the sort of savings we've seen in production.
Meta has had these patches (minus the page fault patch) deployed in production
for almost a year with our own utility for doing on-demand package fetching.
The savings from this has been pretty significant.
The page-fault hooks are necessary for the last thing we need, which is
on-demand range fetching of executables. Some of our binaries are several gigs
large, having the ability to remote fetch them on demand is a huge win for us
not only with space savings, but with startup time of containers.
There will be tests for this going into LTP once we're satisfied with the
patches and they're on their way upstream. Thanks,
Josef
Amir Goldstein (8):
fsnotify: introduce pre-content permission event
fsnotify: generate pre-content permission event on open
fanotify: introduce FAN_PRE_ACCESS permission event
fanotify: introduce FAN_PRE_MODIFY permission event
fanotify: pass optional file access range in pre-content event
fanotify: rename a misnamed constant
fanotify: report file range info with pre-content events
fanotify: allow to set errno in FAN_DENY permission response
Josef Bacik (8):
fanotify: don't skip extra event info if no info_mode is set
fanotify: add a helper to check for pre content events
fanotify: disable readahead if we have pre-content watches
mm: don't allow huge faults for files with pre content watches
fsnotify: generate pre-content permission event on page fault
bcachefs: add pre-content fsnotify hook to fault
gfs2: add pre-content fsnotify hook to fault
xfs: add pre-content fsnotify hook for write faults
fs/bcachefs/fs-io-pagecache.c | 4 +
fs/gfs2/file.c | 4 +
fs/namei.c | 9 ++
fs/notify/fanotify/fanotify.c | 32 ++++++--
fs/notify/fanotify/fanotify.h | 20 +++++
fs/notify/fanotify/fanotify_user.c | 116 +++++++++++++++++++++-----
fs/notify/fsnotify.c | 14 +++-
fs/xfs/xfs_file.c | 4 +
include/linux/fanotify.h | 20 +++--
include/linux/fsnotify.h | 54 ++++++++++--
include/linux/fsnotify_backend.h | 59 ++++++++++++-
include/linux/mm.h | 1 +
include/uapi/linux/fanotify.h | 17 ++++
mm/filemap.c | 128 +++++++++++++++++++++++++++--
mm/memory.c | 22 +++++
mm/readahead.c | 13 +++
security/selinux/hooks.c | 3 +-
17 files changed, 469 insertions(+), 51 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v4 01/16] fanotify: don't skip extra event info if no info_mode is set
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 02/16] fsnotify: introduce pre-content permission event Josef Bacik
` (15 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
New pre-content events will be path events but they will also carry
additional range information. Remove the optimization to skip checking
whether info structures need to be generated for path events. This
results in no change in generated info structures for existing events.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fanotify/fanotify_user.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9ec313e9f6e1..2e2fba8a9d20 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -160,9 +160,6 @@ static size_t fanotify_event_len(unsigned int info_mode,
int fh_len;
int dot_len = 0;
- if (!info_mode)
- return event_len;
-
if (fanotify_is_error_event(event->mask))
event_len += FANOTIFY_ERROR_INFO_LEN;
@@ -740,12 +737,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
if (fanotify_is_perm_event(event->mask))
FANOTIFY_PERM(event)->fd = fd;
- if (info_mode) {
- ret = copy_info_records_to_user(event, info, info_mode, pidfd,
- buf, count);
- if (ret < 0)
- goto out_close_fd;
- }
+ ret = copy_info_records_to_user(event, info, info_mode, pidfd,
+ buf, count);
+ if (ret < 0)
+ goto out_close_fd;
if (f)
fd_install(fd, f);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 02/16] fsnotify: introduce pre-content permission event
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
2024-08-14 21:25 ` [PATCH v4 01/16] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 03/16] fsnotify: generate pre-content permission event on open Josef Bacik
` (14 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.
Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, same as
we did for FS_OPEN_PERM/FS_OPEN_EXEC_PERM.
FS_PRE_MODIFY is a new permission event, with similar semantics as
FS_PRE_ACCESS, which is called before a file is modified.
FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.
The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fsnotify.c | 2 +-
include/linux/fsnotify.h | 27 ++++++++++++++++++++++++---
include/linux/fsnotify_backend.h | 13 +++++++++++--
security/selinux/hooks.c | 3 ++-
4 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 272c8a1dab3c..1ca4a8da7f29 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -621,7 +621,7 @@ static __init int fsnotify_init(void)
{
int ret;
- BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 25);
ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 278620e063ab..7600a0c045ba 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -133,12 +133,13 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
/*
- * fsnotify_file_area_perm - permission hook before access to file range
+ * fsnotify_file_area_perm - permission hook before access/modify of file range
*/
static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
const loff_t *ppos, size_t count)
{
- __u32 fsnotify_mask = FS_ACCESS_PERM;
+ struct inode *inode = file_inode(file);
+ __u32 fsnotify_mask;
/*
* filesystem may be modified in the context of permission events
@@ -147,7 +148,27 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
*/
lockdep_assert_once(file_write_not_started(file));
- if (!(perm_mask & MAY_READ))
+ /*
+ * Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events.
+ */
+ if (perm_mask & MAY_READ) {
+ int ret = fsnotify_file(file, FS_ACCESS_PERM);
+
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Pre-content events are only reported for regular files and dirs.
+ */
+ if (!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode))
+ return 0;
+
+ if (perm_mask & MAY_WRITE)
+ fsnotify_mask = FS_PRE_MODIFY;
+ else if (perm_mask & (MAY_READ | MAY_ACCESS))
+ fsnotify_mask = FS_PRE_ACCESS;
+ else
return 0;
return fsnotify_file(file, fsnotify_mask);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 8be029bc50b1..200a5e3b1cd4 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -56,6 +56,9 @@
#define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
+#define FS_PRE_ACCESS 0x00080000 /* Pre-content access hook */
+#define FS_PRE_MODIFY 0x00100000 /* Pre-content modify hook */
+
/*
* Set on inode mark that cares about things that happen to its children.
* Always set for dnotify and inotify.
@@ -77,8 +80,14 @@
*/
#define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME)
-#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
- FS_OPEN_EXEC_PERM)
+/* Content events can be used to inspect file content */
+#define FSNOTIFY_CONTENT_PERM_EVENTS (FS_OPEN_PERM | FS_OPEN_EXEC_PERM | \
+ FS_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FSNOTIFY_PRE_CONTENT_EVENTS (FS_PRE_ACCESS | FS_PRE_MODIFY)
+
+#define ALL_FSNOTIFY_PERM_EVENTS (FSNOTIFY_CONTENT_PERM_EVENTS | \
+ FSNOTIFY_PRE_CONTENT_EVENTS)
/*
* This is a list of all events that may get sent to a parent that is watching
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 55c78c318ccd..2997edf3e7cd 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3406,7 +3406,8 @@ static int selinux_path_notify(const struct path *path, u64 mask,
perm |= FILE__WATCH_WITH_PERM;
/* watches on read-like events need the file:watch_reads permission */
- if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_CLOSE_NOWRITE))
+ if (mask & (FS_ACCESS | FS_ACCESS_PERM | FS_PRE_ACCESS |
+ FS_CLOSE_NOWRITE))
perm |= FILE__WATCH_READS;
return path_has_perm(current_cred(), path, perm);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 03/16] fsnotify: generate pre-content permission event on open
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
2024-08-14 21:25 ` [PATCH v4 01/16] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
2024-08-14 21:25 ` [PATCH v4 02/16] fsnotify: introduce pre-content permission event Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 04/16] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
` (13 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on open depending on
file open mode. The pre-content event will be generated in addition to
FS_OPEN_PERM, but without sb_writers held and after file was truncated
in case file was opened with O_CREAT and/or O_TRUNC.
The event will have a range info of (0..0) to provide an opportunity
to fill entire file content on open.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 9 +++++++++
include/linux/fsnotify.h | 10 +++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3a4c40e12f78..c16487e3742d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3735,6 +3735,15 @@ static int do_open(struct nameidata *nd,
}
if (do_truncate)
mnt_drop_write(nd->path.mnt);
+
+ /*
+ * This permission hook is different than fsnotify_open_perm() hook.
+ * This is a pre-content hook that is called without sb_writers held
+ * and after the file was truncated.
+ */
+ if (!error)
+ error = fsnotify_file_perm(file, MAY_OPEN);
+
return error;
}
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 7600a0c045ba..fb3837b8de4c 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -168,6 +168,10 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
fsnotify_mask = FS_PRE_MODIFY;
else if (perm_mask & (MAY_READ | MAY_ACCESS))
fsnotify_mask = FS_PRE_ACCESS;
+ else if (perm_mask & MAY_OPEN && file->f_mode & FMODE_WRITER)
+ fsnotify_mask = FS_PRE_MODIFY;
+ else if (perm_mask & MAY_OPEN)
+ fsnotify_mask = FS_PRE_ACCESS;
else
return 0;
@@ -176,10 +180,14 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
/*
* fsnotify_file_perm - permission hook before file access
+ *
+ * Called from read()/write() with perm_mask MAY_READ/MAY_WRITE.
+ * Called from open() with MAY_OPEN without sb_writers held and after the file
+ * was truncated. Note that this is a different event from fsnotify_open_perm().
*/
static inline int fsnotify_file_perm(struct file *file, int perm_mask)
{
- return fsnotify_file_area_perm(file, perm_mask, NULL, 0);
+ return fsnotify_file_area_perm(file, perm_mask, &file->f_pos, 0);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 04/16] fanotify: introduce FAN_PRE_ACCESS permission event
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (2 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 03/16] fsnotify: generate pre-content permission event on open Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 05/16] fanotify: introduce FAN_PRE_MODIFY " Josef Bacik
` (12 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
Similar to FAN_ACCESS_PERM permission event, but it is only allowed with
class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs.
Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed
in the context of the event handler.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill the content of files on first read access.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
---
fs/notify/fanotify/fanotify.c | 3 ++-
fs/notify/fanotify/fanotify_user.c | 17 ++++++++++++++---
include/linux/fanotify.h | 14 ++++++++++----
include/uapi/linux/fanotify.h | 2 ++
4 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 224bccaab4cc..7dac8e4486df 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -910,8 +910,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
+ BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 21);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 2e2fba8a9d20..c294849e474f 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1628,6 +1628,7 @@ static int fanotify_events_supported(struct fsnotify_group *group,
unsigned int flags)
{
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
+ bool is_dir = d_is_dir(path->dentry);
/* Strict validation of events in non-dir inode mask with v5.17+ APIs */
bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
(mask & FAN_RENAME) ||
@@ -1665,9 +1666,15 @@ static int fanotify_events_supported(struct fsnotify_group *group,
* but because we always allowed it, error only when using new APIs.
*/
if (strict_dir_events && mark_type == FAN_MARK_INODE &&
- !d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
+ !is_dir && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
return -ENOTDIR;
+ /* Pre-content events are only supported on regular files and dirs */
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
+ if (!is_dir && !d_is_reg(path->dentry))
+ return -EINVAL;
+ }
+
return 0;
}
@@ -1769,11 +1776,15 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
goto fput_and_out;
/*
- * Permission events require minimum priority FAN_CLASS_CONTENT.
+ * Permission events are not allowed for FAN_CLASS_NOTIF.
+ * Pre-content permission events are not allowed for FAN_CLASS_CONTENT.
*/
ret = -EINVAL;
if (mask & FANOTIFY_PERM_EVENTS &&
- group->priority < FSNOTIFY_PRIO_CONTENT)
+ group->priority == FSNOTIFY_PRIO_NORMAL)
+ goto fput_and_out;
+ else if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
+ group->priority == FSNOTIFY_PRIO_CONTENT)
goto fput_and_out;
if (mask & FAN_FS_ERROR &&
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 4f1c4f603118..5c811baf44d2 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -88,6 +88,16 @@
#define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE | \
FAN_RENAME)
+/* Content events can be used to inspect file content */
+#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
+ FAN_ACCESS_PERM)
+/* Pre-content events can be used to fill file content */
+#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
+
+/* Events that require a permission response from user */
+#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
+ FANOTIFY_PRE_CONTENT_EVENTS)
+
/* Events that can be reported with event->fd */
#define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
@@ -103,10 +113,6 @@
FANOTIFY_INODE_EVENTS | \
FANOTIFY_ERROR_EVENTS)
-/* Events that require a permission response from user */
-#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
- FAN_OPEN_EXEC_PERM)
-
/* Extra flags that may be reported with event or control handling of events */
#define FANOTIFY_EVENT_FLAGS (FAN_EVENT_ON_CHILD | FAN_ONDIR)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index a37de58ca571..bcada21a3a2e 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -26,6 +26,8 @@
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
+#define FAN_PRE_ACCESS 0x00080000 /* Pre-content access hook */
+
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
#define FAN_RENAME 0x10000000 /* File was renamed */
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 05/16] fanotify: introduce FAN_PRE_MODIFY permission event
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (3 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 04/16] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 06/16] fanotify: pass optional file access range in pre-content event Josef Bacik
` (11 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
Generate FAN_PRE_MODIFY permission event from fsnotify_file_perm()
pre-write hook to notify fanotify listeners on an intent to make
modification to a file.
Like FAN_PRE_ACCESS, it is only allowed with FAN_CLASS_PRE_CONTENT
and unlike FAN_MODIFY, it is only allowed on regular files.
Like FAN_PRE_ACCESS, it is generated without sb_start_write() held,
so it is safe to perform filesystem modifications in the context of
event handler.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill the content of files on first write access.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
---
fs/notify/fanotify/fanotify.c | 3 ++-
fs/notify/fanotify/fanotify_user.c | 2 ++
include/linux/fanotify.h | 3 ++-
include/uapi/linux/fanotify.h | 1 +
4 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 7dac8e4486df..b163594843f5 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -911,8 +911,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
BUILD_BUG_ON(FAN_RENAME != FS_RENAME);
BUILD_BUG_ON(FAN_PRE_ACCESS != FS_PRE_ACCESS);
+ BUILD_BUG_ON(FAN_PRE_MODIFY != FS_PRE_MODIFY);
- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 22);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 23);
mask = fanotify_group_event_mask(group, iter_info, &match_mask,
mask, data, data_type, dir);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index c294849e474f..3a7101544f30 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1673,6 +1673,8 @@ static int fanotify_events_supported(struct fsnotify_group *group,
if (mask & FANOTIFY_PRE_CONTENT_EVENTS) {
if (!is_dir && !d_is_reg(path->dentry))
return -EINVAL;
+ if (is_dir && mask & FAN_PRE_MODIFY)
+ return -EISDIR;
}
return 0;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 5c811baf44d2..ae6cb2688d52 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -92,7 +92,8 @@
#define FANOTIFY_CONTENT_PERM_EVENTS (FAN_OPEN_PERM | FAN_OPEN_EXEC_PERM | \
FAN_ACCESS_PERM)
/* Pre-content events can be used to fill file content */
-#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS)
+#define FANOTIFY_PRE_CONTENT_EVENTS (FAN_PRE_ACCESS | FAN_PRE_MODIFY)
+#define FANOTIFY_PRE_MODIFY_EVENTS (FAN_PRE_MODIFY)
/* Events that require a permission response from user */
#define FANOTIFY_PERM_EVENTS (FANOTIFY_CONTENT_PERM_EVENTS | \
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index bcada21a3a2e..ac00fad66416 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -27,6 +27,7 @@
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */
#define FAN_PRE_ACCESS 0x00080000 /* Pre-content access hook */
+#define FAN_PRE_MODIFY 0x00100000 /* Pre-content modify hook */
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 06/16] fanotify: pass optional file access range in pre-content event
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (4 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 05/16] fanotify: introduce FAN_PRE_MODIFY " Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 07/16] fanotify: rename a misnamed constant Josef Bacik
` (10 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
We would like to add file range information to pre-content events.
Pass a struct file_range with optional offset and length to event handler
along with pre-content permission event.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 10 ++++++++--
fs/notify/fanotify/fanotify.h | 2 ++
include/linux/fsnotify.h | 17 ++++++++++++++++-
include/linux/fsnotify_backend.h | 32 ++++++++++++++++++++++++++++++++
4 files changed, 58 insertions(+), 3 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index b163594843f5..4e8dce39fa8f 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -549,9 +549,13 @@ static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
return &pevent->fae;
}
-static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
+static struct fanotify_event *fanotify_alloc_perm_event(const void *data,
+ int data_type,
gfp_t gfp)
{
+ const struct path *path = fsnotify_data_path(data, data_type);
+ const struct file_range *range =
+ fsnotify_data_file_range(data, data_type);
struct fanotify_perm_event *pevent;
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
@@ -565,6 +569,8 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
pevent->hdr.len = 0;
pevent->state = FAN_EVENT_INIT;
pevent->path = *path;
+ pevent->ppos = range ? range->ppos : NULL;
+ pevent->count = range ? range->count : 0;
path_get(path);
return &pevent->fae;
@@ -802,7 +808,7 @@ static struct fanotify_event *fanotify_alloc_event(
old_memcg = set_active_memcg(group->memcg);
if (fanotify_is_perm_event(mask)) {
- event = fanotify_alloc_perm_event(path, gfp);
+ event = fanotify_alloc_perm_event(data, data_type, gfp);
} else if (fanotify_is_error_event(mask)) {
event = fanotify_alloc_error_event(group, fsid, data,
data_type, &hash);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index e5ab33cae6a7..93598b7d5952 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -425,6 +425,8 @@ FANOTIFY_PE(struct fanotify_event *event)
struct fanotify_perm_event {
struct fanotify_event fae;
struct path path;
+ const loff_t *ppos; /* optional file range info */
+ size_t count;
u32 response; /* userspace answer to the event */
unsigned short state; /* state of the event */
int fd; /* fd we passed to userspace for this event */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index fb3837b8de4c..9d001d328619 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -132,6 +132,21 @@ static inline int fsnotify_file(struct file *file, __u32 mask)
}
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+static inline int fsnotify_file_range(struct file *file, __u32 mask,
+ const loff_t *ppos, size_t count)
+{
+ struct file_range range;
+
+ if (file->f_mode & FMODE_NONOTIFY)
+ return 0;
+
+ range.path = &file->f_path;
+ range.ppos = ppos;
+ range.count = count;
+ return fsnotify_parent(range.path->dentry, mask, &range,
+ FSNOTIFY_EVENT_FILE_RANGE);
+}
+
/*
* fsnotify_file_area_perm - permission hook before access/modify of file range
*/
@@ -175,7 +190,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask,
else
return 0;
- return fsnotify_file(file, fsnotify_mask);
+ return fsnotify_file_range(file, fsnotify_mask, ppos, count);
}
/*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 200a5e3b1cd4..276320846bfd 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -298,6 +298,7 @@ static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
/* When calling fsnotify tell it if the data is a path or inode */
enum fsnotify_data_type {
FSNOTIFY_EVENT_NONE,
+ FSNOTIFY_EVENT_FILE_RANGE,
FSNOTIFY_EVENT_PATH,
FSNOTIFY_EVENT_INODE,
FSNOTIFY_EVENT_DENTRY,
@@ -310,6 +311,17 @@ struct fs_error_report {
struct super_block *sb;
};
+struct file_range {
+ const struct path *path;
+ const loff_t *ppos;
+ size_t count;
+};
+
+static inline const struct path *file_range_path(const struct file_range *range)
+{
+ return range->path;
+}
+
static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
{
switch (data_type) {
@@ -319,6 +331,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
return d_inode(data);
case FSNOTIFY_EVENT_PATH:
return d_inode(((const struct path *)data)->dentry);
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return d_inode(file_range_path(data)->dentry);
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *)data)->inode;
default:
@@ -334,6 +348,8 @@ static inline struct dentry *fsnotify_data_dentry(const void *data, int data_typ
return (struct dentry *)data;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry;
default:
return NULL;
}
@@ -345,6 +361,8 @@ static inline const struct path *fsnotify_data_path(const void *data,
switch (data_type) {
case FSNOTIFY_EVENT_PATH:
return data;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data);
default:
return NULL;
}
@@ -360,6 +378,8 @@ static inline struct super_block *fsnotify_data_sb(const void *data,
return ((struct dentry *)data)->d_sb;
case FSNOTIFY_EVENT_PATH:
return ((const struct path *)data)->dentry->d_sb;
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return file_range_path(data)->dentry->d_sb;
case FSNOTIFY_EVENT_ERROR:
return ((struct fs_error_report *) data)->sb;
default:
@@ -379,6 +399,18 @@ static inline struct fs_error_report *fsnotify_data_error_report(
}
}
+static inline const struct file_range *fsnotify_data_file_range(
+ const void *data,
+ int data_type)
+{
+ switch (data_type) {
+ case FSNOTIFY_EVENT_FILE_RANGE:
+ return (struct file_range *)data;
+ default:
+ return NULL;
+ }
+}
+
/*
* Index to merged marks iterator array that correlates to a type of watch.
* The type of watched object can be deduced from the iterator type, but not
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 07/16] fanotify: rename a misnamed constant
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (5 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 06/16] fanotify: pass optional file access range in pre-content event Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 08/16] fanotify: report file range info with pre-content events Josef Bacik
` (9 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
FANOTIFY_PIDFD_INFO_HDR_LEN is not the length of the header.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
---
fs/notify/fanotify/fanotify_user.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 3a7101544f30..5ece186d5c50 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -119,7 +119,7 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
#define FANOTIFY_EVENT_ALIGN 4
#define FANOTIFY_FID_INFO_HDR_LEN \
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
-#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+#define FANOTIFY_PIDFD_INFO_LEN \
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
@@ -174,14 +174,14 @@ static size_t fanotify_event_len(unsigned int info_mode,
dot_len = 1;
}
- if (info_mode & FAN_REPORT_PIDFD)
- event_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
-
if (fanotify_event_has_object_fh(event)) {
fh_len = fanotify_event_object_fh_len(event);
event_len += fanotify_fid_info_len(fh_len, dot_len);
}
+ if (info_mode & FAN_REPORT_PIDFD)
+ event_len += FANOTIFY_PIDFD_INFO_LEN;
+
return event_len;
}
@@ -511,7 +511,7 @@ static int copy_pidfd_info_to_user(int pidfd,
size_t count)
{
struct fanotify_event_info_pidfd info = { };
- size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+ size_t info_len = FANOTIFY_PIDFD_INFO_LEN;
if (WARN_ON_ONCE(info_len > count))
return -EFAULT;
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 08/16] fanotify: report file range info with pre-content events
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (6 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 07/16] fanotify: rename a misnamed constant Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 10:13 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
` (8 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
With group class FAN_CLASS_PRE_CONTENT, report offset and length info
along with FAN_PRE_ACCESS and FAN_PRE_MODIFY permission events.
This information is meant to be used by hierarchical storage managers
that want to fill partial content of files on first access to range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.h | 8 +++++++
fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 7 ++++++
3 files changed, 53 insertions(+)
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 93598b7d5952..7f06355afa1f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -448,6 +448,14 @@ static inline bool fanotify_is_perm_event(u32 mask)
mask & FANOTIFY_PERM_EVENTS;
}
+static inline bool fanotify_event_has_access_range(struct fanotify_event *event)
+{
+ if (!(event->mask & FANOTIFY_PRE_CONTENT_EVENTS))
+ return false;
+
+ return FANOTIFY_PERM(event)->ppos;
+}
+
static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
{
return container_of(fse, struct fanotify_event, fse);
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5ece186d5c50..ed56fe6f5ec7 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -123,6 +123,8 @@ struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
sizeof(struct fanotify_event_info_pidfd)
#define FANOTIFY_ERROR_INFO_LEN \
(sizeof(struct fanotify_event_info_error))
+#define FANOTIFY_RANGE_INFO_LEN \
+ (sizeof(struct fanotify_event_info_range))
static int fanotify_fid_info_len(int fh_len, int name_len)
{
@@ -182,6 +184,9 @@ static size_t fanotify_event_len(unsigned int info_mode,
if (info_mode & FAN_REPORT_PIDFD)
event_len += FANOTIFY_PIDFD_INFO_LEN;
+ if (fanotify_event_has_access_range(event))
+ event_len += FANOTIFY_RANGE_INFO_LEN;
+
return event_len;
}
@@ -526,6 +531,30 @@ static int copy_pidfd_info_to_user(int pidfd,
return info_len;
}
+static size_t copy_range_info_to_user(struct fanotify_event *event,
+ char __user *buf, int count)
+{
+ struct fanotify_perm_event *pevent = FANOTIFY_PERM(event);
+ struct fanotify_event_info_range info = { };
+ size_t info_len = FANOTIFY_RANGE_INFO_LEN;
+
+ if (WARN_ON_ONCE(info_len > count))
+ return -EFAULT;
+
+ if (WARN_ON_ONCE(!pevent->ppos))
+ return -EINVAL;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_RANGE;
+ info.hdr.len = info_len;
+ info.offset = *(pevent->ppos);
+ info.count = pevent->count;
+
+ if (copy_to_user(buf, &info, info_len))
+ return -EFAULT;
+
+ return info_len;
+}
+
static int copy_info_records_to_user(struct fanotify_event *event,
struct fanotify_info *info,
unsigned int info_mode, int pidfd,
@@ -647,6 +676,15 @@ static int copy_info_records_to_user(struct fanotify_event *event,
total_bytes += ret;
}
+ if (fanotify_event_has_access_range(event)) {
+ ret = copy_range_info_to_user(event, buf, count);
+ if (ret < 0)
+ return ret;
+ buf += ret;
+ count -= ret;
+ total_bytes += ret;
+ }
+
return total_bytes;
}
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index ac00fad66416..cc28dce5f744 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -145,6 +145,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_DFID 3
#define FAN_EVENT_INFO_TYPE_PIDFD 4
#define FAN_EVENT_INFO_TYPE_ERROR 5
+#define FAN_EVENT_INFO_TYPE_RANGE 6
/* Special info types for FAN_RENAME */
#define FAN_EVENT_INFO_TYPE_OLD_DFID_NAME 10
@@ -191,6 +192,12 @@ struct fanotify_event_info_error {
__u32 error_count;
};
+struct fanotify_event_info_range {
+ struct fanotify_event_info_header hdr;
+ __u64 offset;
+ __u64 count;
+};
+
/*
* User space may need to record additional information about its decision.
* The extra information type records what kind of information is included.
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (7 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 08/16] fanotify: report file range info with pre-content events Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 10:25 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 10/16] fanotify: add a helper to check for pre content events Josef Bacik
` (7 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
From: Amir Goldstein <amir73il@gmail.com>
With FAN_DENY response, user trying to perform the filesystem operation
gets an error with errno set to EPERM.
It is useful for hierarchical storage management (HSM) service to be able
to deny access for reasons more diverse than EPERM, for example EAGAIN,
if HSM could retry the operation later.
Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd
to permission events with the response value FAN_DENY_ERRNO(errno),
instead of FAN_DENY to return a custom error.
Limit custom error values to errors expected on read(2)/write(2) and
open(2) of regular files. This list could be extended in the future.
Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by
writing a response to an fanotify group fd with a value of FAN_NOFD in
the fd field of the response.
The change in fanotify_response is backward compatible, because errno is
written in the high 8 bits of the 32bit response field and old kernels
reject respose value with high bits set.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
fs/notify/fanotify/fanotify.c | 18 ++++++++++-----
fs/notify/fanotify/fanotify.h | 10 +++++++++
fs/notify/fanotify/fanotify_user.c | 36 +++++++++++++++++++++++++-----
include/linux/fanotify.h | 5 ++++-
include/uapi/linux/fanotify.h | 7 ++++++
5 files changed, 65 insertions(+), 11 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 4e8dce39fa8f..1cbf41b34080 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -224,7 +224,8 @@ static int fanotify_get_response(struct fsnotify_group *group,
struct fanotify_perm_event *event,
struct fsnotify_iter_info *iter_info)
{
- int ret;
+ int ret, errno;
+ u32 decision;
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
@@ -257,20 +258,27 @@ static int fanotify_get_response(struct fsnotify_group *group,
goto out;
}
+ decision = fanotify_get_response_decision(event->response);
/* userspace responded, convert to something usable */
- switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
ret = 0;
break;
case FAN_DENY:
+ /* Check custom errno from pre-content events */
+ errno = fanotify_get_response_errno(event->response);
+ if (errno) {
+ ret = -errno;
+ break;
+ }
+ fallthrough;
default:
ret = -EPERM;
}
/* Check if the response should be audited */
- if (event->response & FAN_AUDIT)
- audit_fanotify(event->response & ~FAN_AUDIT,
- &event->audit_rule);
+ if (decision & FAN_AUDIT)
+ audit_fanotify(decision & ~FAN_AUDIT, &event->audit_rule);
pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
group, event, ret);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 7f06355afa1f..d0722ef13138 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -528,3 +528,13 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
return mflags;
}
+
+static inline u32 fanotify_get_response_decision(u32 res)
+{
+ return res & (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
+}
+
+static inline int fanotify_get_response_errno(int res)
+{
+ return res >> FAN_ERRNO_SHIFT;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index ed56fe6f5ec7..0a37f1c761aa 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -337,11 +337,13 @@ static int process_access_response(struct fsnotify_group *group,
struct fanotify_perm_event *event;
int fd = response_struct->fd;
u32 response = response_struct->response;
+ u32 decision = fanotify_get_response_decision(response);
+ int errno = fanotify_get_response_errno(response);
int ret = info_len;
struct fanotify_response_info_audit_rule friar;
- pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
- group, fd, response, info, info_len);
+ pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n",
+ __func__, group, fd, response, errno, info, info_len);
/*
* make sure the response is valid, if invalid we do nothing and either
* userspace can send a valid response or we will clean it up after the
@@ -350,18 +352,42 @@ static int process_access_response(struct fsnotify_group *group,
if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
return -EINVAL;
- switch (response & FANOTIFY_RESPONSE_ACCESS) {
+ switch (decision & FANOTIFY_RESPONSE_ACCESS) {
case FAN_ALLOW:
+ if (errno)
+ return -EINVAL;
+ break;
case FAN_DENY:
+ /* Custom errno is supported only for pre-content groups */
+ if (errno && group->priority != FSNOTIFY_PRIO_PRE_CONTENT)
+ return -EINVAL;
+
+ /*
+ * Limit errno to values expected on open(2)/read(2)/write(2)
+ * of regular files.
+ */
+ switch (errno) {
+ case 0:
+ case EIO:
+ case EPERM:
+ case EBUSY:
+ case ETXTBSY:
+ case EAGAIN:
+ case ENOSPC:
+ case EDQUOT:
+ break;
+ default:
+ return -EINVAL;
+ }
break;
default:
return -EINVAL;
}
- if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
+ if ((decision & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
return -EINVAL;
- if (response & FAN_INFO) {
+ if (decision & FAN_INFO) {
ret = process_access_response_info(info, info_len, &friar);
if (ret < 0)
return ret;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index ae6cb2688d52..547514542669 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -132,7 +132,10 @@
/* These masks check for invalid bits in permission responses. */
#define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY)
#define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO)
-#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS)
+#define FANOTIFY_RESPONSE_ERRNO (FAN_ERRNO_MASK << FAN_ERRNO_SHIFT)
+#define FANOTIFY_RESPONSE_VALID_MASK \
+ (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \
+ FANOTIFY_RESPONSE_ERRNO)
/* Do not use these old uapi constants internally */
#undef FAN_ALL_CLASS_BITS
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index cc28dce5f744..7b746c5fcbd8 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -233,6 +233,13 @@ struct fanotify_response_info_audit_rule {
/* Legit userspace responses to a _PERM event */
#define FAN_ALLOW 0x01
#define FAN_DENY 0x02
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_ERRNO_BITS 8
+#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS)
+#define FAN_ERRNO_MASK ((1 << FAN_ERRNO_BITS) - 1)
+#define FAN_DENY_ERRNO(err) \
+ (FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT))
+
#define FAN_AUDIT 0x10 /* Bitmask to create audit record for result */
#define FAN_INFO 0x20 /* Bitmask to indicate additional information */
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 10/16] fanotify: add a helper to check for pre content events
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (8 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches Josef Bacik
` (6 subsequent siblings)
16 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
We want to emit events during page fault, and calling into fanotify
could be expensive, so add a helper to allow us to skip calling into
fanotify from page fault. This will also be used to disable readahead
for content watched files which will be handled in a subsequent patch.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/notify/fsnotify.c | 12 ++++++++++++
include/linux/fsnotify_backend.h | 14 ++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 1ca4a8da7f29..cbfaa000f815 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -201,6 +201,18 @@ static inline bool fsnotify_object_watched(struct inode *inode, __u32 mnt_mask,
return mask & marks_mask & ALL_FSNOTIFY_EVENTS;
}
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ struct inode *inode = file_inode(file);
+ __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
+
+ return fsnotify_object_watched(inode, mnt_mask,
+ FSNOTIFY_PRE_CONTENT_EVENTS);
+}
+#endif
+
+
/*
* Notify this dentry's parent about a child's events with child name info
* if parent is watching or if inode/sb/mount are interested in events with
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 276320846bfd..b495a0676dd3 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -900,6 +900,15 @@ static inline void fsnotify_init_event(struct fsnotify_event *event)
INIT_LIST_HEAD(&event->list);
}
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+bool fsnotify_file_has_pre_content_watches(struct file *file);
+#else
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return false;
+}
+#endif /* CONFIG_FANOTIFY_ACCESS_PERMISSIONS */
+
#else
static inline int fsnotify(__u32 mask, const void *data, int data_type,
@@ -938,6 +947,11 @@ static inline u32 fsnotify_get_cookie(void)
static inline void fsnotify_unmount_inodes(struct super_block *sb)
{}
+static inline bool fsnotify_file_has_pre_content_watches(struct file *file)
+{
+ return false;
+}
+
#endif /* CONFIG_FSNOTIFY */
#endif /* __KERNEL __ */
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (9 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 10/16] fanotify: add a helper to check for pre content events Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 10:48 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches Josef Bacik
` (5 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event. To avoid this case, disable
readahead if we have pre-content watches on the file. This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/filemap.c | 12 ++++++++++++
mm/readahead.c | 13 +++++++++++++
2 files changed, 25 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index ca8c8d889eef..8b1684b62177 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3122,6 +3122,14 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
unsigned long vm_flags = vmf->vma->vm_flags;
unsigned int mmap_miss;
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't populate our mapping with 0 filled pages that we
+ * never emitted an event for.
+ */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
@@ -3190,6 +3198,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
struct file *fpin = NULL;
unsigned int mmap_miss;
+ /* See comment in do_sync_mmap_readahead. */
+ if (fsnotify_file_has_pre_content_watches(file))
+ return fpin;
+
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
diff --git a/mm/readahead.c b/mm/readahead.c
index 817b2a352d78..bc068d9218e3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -128,6 +128,7 @@
#include <linux/blk-cgroup.h>
#include <linux/fadvise.h>
#include <linux/sched/mm.h>
+#include <linux/fsnotify.h>
#include "internal.h"
@@ -674,6 +675,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
{
bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
+ /*
+ * If we have pre-content watches we need to disable readahead to make
+ * sure that we don't find 0 filled pages in cache that we never emitted
+ * events for.
+ */
+ if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Even if readahead is disabled, issue this request as readahead
* as we'll need it to satisfy the requested range. The forced
@@ -704,6 +713,10 @@ void page_cache_async_ra(struct readahead_control *ractl,
if (!ractl->ra->ra_pages)
return;
+ /* See the comment in page_cache_sync_ra. */
+ if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
+ return;
+
/*
* Same bit is used for PG_readahead and PG_reclaim.
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (10 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 10:51 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault Josef Bacik
` (4 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
There's nothing stopping us from supporting this, we could simply pass
the order into the helper and emit the proper length. However currently
there's no tests to validate this works properly, so disable it until
there's a desire to support this along with the appropriate tests.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
mm/memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index d10e616d7389..3010bcc5e4f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
#include <linux/sched/sysctl.h>
+#include <linux/fsnotify.h>
#include <trace/events/kmem.h>
@@ -5252,8 +5253,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
if (vma_is_anonymous(vma))
return do_huge_pmd_anonymous_page(vmf);
+ /*
+ * Currently we just emit PAGE_SIZE for our fault events, so don't allow
+ * a huge fault if we have a pre content watch on this file. This would
+ * be trivial to support, but there would need to be tests to ensure
+ * this works properly and those don't exist currently.
+ */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
return VM_FAULT_FALLBACK;
@@ -5263,6 +5273,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
vm_fault_t ret;
@@ -5277,6 +5288,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
}
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
@@ -5296,9 +5310,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
return VM_FAULT_FALLBACK;
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ return VM_FAULT_FALLBACK;
if (vma->vm_ops->huge_fault)
return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5310,12 +5328,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
vm_fault_t ret;
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vma))
goto split;
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /* See comment in create_huge_pmd. */
+ if (file && fsnotify_file_has_pre_content_watches(file))
+ goto split;
if (vma->vm_ops->huge_fault) {
ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
if (!(ret & VM_FAULT_FALLBACK))
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (11 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 11:07 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault Josef Bacik
` (3 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
on the faulting method.
This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.
Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things with
in filemap_fault.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
include/linux/mm.h | 1 +
mm/filemap.c | 116 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 110 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab3d78116043..3e190f0a0997 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3503,6 +3503,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
+extern vm_fault_t filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf);
extern unsigned long stack_guard_gap;
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
diff --git a/mm/filemap.c b/mm/filemap.c
index 8b1684b62177..50e88e47dff3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -46,6 +46,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/splice.h>
#include <linux/rcupdate_wait.h>
+#include <linux/fsnotify.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -3112,13 +3113,13 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
* that. If we didn't pin a file then we return NULL. The file that is
* returned needs to be fput()'ed when we're done with it.
*/
-static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
+static struct file *do_sync_mmap_readahead(struct vm_fault *vmf,
+ struct file *fpin)
{
struct file *file = vmf->vma->vm_file;
struct file_ra_state *ra = &file->f_ra;
struct address_space *mapping = file->f_mapping;
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
- struct file *fpin = NULL;
unsigned long vm_flags = vmf->vma->vm_flags;
unsigned int mmap_miss;
@@ -3190,12 +3191,12 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
* was pinned if we have to drop the mmap_lock in order to do IO.
*/
static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
- struct folio *folio)
+ struct folio *folio,
+ struct file *fpin)
{
struct file *file = vmf->vma->vm_file;
struct file_ra_state *ra = &file->f_ra;
DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
- struct file *fpin = NULL;
unsigned int mmap_miss;
/* See comment in do_sync_mmap_readahead. */
@@ -3260,6 +3261,93 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
return ret;
}
+/*
+ * If we have pre-content watches on this file we will need to emit an event for
+ * this range. We will handle dropping the lock and emitting the event.
+ *
+ * If FAULT_FLAG_RETRY_NOWAIT is set then we'll return VM_FAULT_RETRY.
+ *
+ * If no event was emitted then *fpin will be NULL and we will return 0.
+ *
+ * If any error occurred we will return VM_FAULT_SIGBUS, *fpin could still be
+ * set and will need to have fput() called on it.
+ *
+ * If we emitted the event then we will return 0 and *fpin will be set, this
+ * must have fput() called on it, and the caller must call VM_FAULT_RETRY after
+ * any other operations it does in order to re-fault the page and make sure the
+ * appropriate locking is maintained.
+ *
+ * Return: the appropriate vm_fault_t return code, 0 on success.
+ */
+static vm_fault_t __filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf,
+ struct file **fpin)
+{
+ struct file *file = vmf->vma->vm_file;
+ loff_t pos = vmf->pgoff << PAGE_SHIFT;
+ int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
+ int ret;
+
+ /*
+ * We already did this and now we're retrying with everything locked,
+ * don't emit the event and continue.
+ */
+ if (vmf->flags & FAULT_FLAG_TRIED)
+ return 0;
+
+ /* No watches, return NULL. */
+ if (!fsnotify_file_has_pre_content_watches(file))
+ return 0;
+
+ /* We are NOWAIT, we can't wait, just return EAGAIN. */
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+ return VM_FAULT_RETRY;
+
+ /*
+ * If this fails then we're not allowed to drop the fault lock, return a
+ * SIGBUS so we don't errantly populate pagecache with bogus data for
+ * this file.
+ */
+ *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
+ if (*fpin == NULL)
+ return VM_FAULT_SIGBUS;
+
+ /*
+ * We can't fput(*fpin) at this point because we could have been passed
+ * in fpin from a previous call.
+ */
+ ret = fsnotify_file_area_perm(*fpin, mask, &pos, PAGE_SIZE);
+ if (ret)
+ return VM_FAULT_SIGBUS;
+
+ return 0;
+}
+
+/**
+ * filemap_maybe_emit_fsnotify_event - maybe emit a pre-content event.
+ * @vmf: struct vm_fault containing details of the fault.
+ *
+ * If we have a pre-content watch on this file we will emit an event for this
+ * range. If we return anything the fault caller should return immediately, we
+ * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
+ * fault again and then the fault handler will run the second time through.
+ *
+ * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
+ */
+vm_fault_t filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf)
+{
+ struct file *fpin = NULL;
+ vm_fault_t ret;
+
+ ret = __filemap_maybe_emit_fsnotify_event(vmf, &fpin);
+ if (fpin) {
+ fput(fpin);
+ if (!ret)
+ ret = VM_FAULT_RETRY;
+ }
+ return ret;
+}
+EXPORT_SYMBOL_GPL(filemap_maybe_emit_fsnotify_event);
+
/**
* filemap_fault - read in file data for page fault handling
* @vmf: struct vm_fault containing details of the fault
@@ -3299,6 +3387,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (unlikely(index >= max_idx))
return VM_FAULT_SIGBUS;
+ /*
+ * If we have pre-content watchers then we need to generate events on
+ * page fault so that we can populate any data before the fault.
+ */
+ ret = __filemap_maybe_emit_fsnotify_event(vmf, &fpin);
+ if (unlikely(ret)) {
+ if (fpin)
+ fput(fpin);
+ return ret;
+ }
+
/*
* Do we have something in the page cache already?
*/
@@ -3309,21 +3408,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* the lock.
*/
if (!(vmf->flags & FAULT_FLAG_TRIED))
- fpin = do_async_mmap_readahead(vmf, folio);
+ fpin = do_async_mmap_readahead(vmf, folio, fpin);
if (unlikely(!folio_test_uptodate(folio))) {
filemap_invalidate_lock_shared(mapping);
mapping_locked = true;
}
} else {
ret = filemap_fault_recheck_pte_none(vmf);
- if (unlikely(ret))
+ if (unlikely(ret)) {
+ if (fpin)
+ goto out_retry;
return ret;
+ }
/* No page in the page cache at all */
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
- fpin = do_sync_mmap_readahead(vmf);
+ fpin = do_sync_mmap_readahead(vmf, fpin);
retry_find:
/*
* See comment in filemap_create_folio() why we need
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (12 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 11:10 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 15/16] gfs2: " Josef Bacik
` (2 subsequent siblings)
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
bcachefs has its own locking around filemap_fault, so we have to make
sure we do the fsnotify hook before the locking. Add the check to emit
the event before the locking and return VM_FAULT_RETRY to retrigger the
fault once the event has been emitted.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/bcachefs/fs-io-pagecache.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/bcachefs/fs-io-pagecache.c b/fs/bcachefs/fs-io-pagecache.c
index a9cc5cad9cc9..1fa1f1ac48c8 100644
--- a/fs/bcachefs/fs-io-pagecache.c
+++ b/fs/bcachefs/fs-io-pagecache.c
@@ -570,6 +570,10 @@ vm_fault_t bch2_page_fault(struct vm_fault *vmf)
if (fdm == mapping)
return VM_FAULT_SIGBUS;
+ ret = filemap_maybe_emit_fsnotify_event(vmf);
+ if (unlikely(ret))
+ return ret;
+
/* Lock ordering: */
if (fdm > mapping) {
struct bch_inode_info *fdm_host = to_bch_ei(fdm->host);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (13 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 11:15 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults Josef Bacik
2024-08-29 21:41 ` [PATCH v4 00/16] fanotify: add pre-content hooks Darrick J. Wong
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
gfs2 takes the glock before calling into filemap fault, so add the
fsnotify hook for ->fault before we take the glock in order to avoid any
possible deadlock with the HSM.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/gfs2/file.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 08982937b5df..d4af70d765e0 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -556,6 +556,10 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
vm_fault_t ret;
int err;
+ ret = filemap_maybe_emit_fsnotify_event(vmf);
+ if (unlikely(ret))
+ return ret;
+
gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
err = gfs2_glock_nq(&gh);
if (err) {
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (14 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 15/16] gfs2: " Josef Bacik
@ 2024-08-14 21:25 ` Josef Bacik
2024-08-29 11:17 ` Jan Kara
2024-08-29 21:41 ` [PATCH v4 00/16] fanotify: add pre-content hooks Darrick J. Wong
16 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-14 21:25 UTC (permalink / raw)
To: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
xfs has it's own handling for write faults, so we need to add the
pre-content fsnotify hook for this case. Reads go through filemap_fault
so they're handled properly there.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/xfs/xfs_file.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4cdc54dc9686..e61c4c389d7d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1283,6 +1283,10 @@ xfs_write_fault(
unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
vm_fault_t ret;
+ ret = filemap_maybe_emit_fsnotify_event(vmf);
+ if (unlikely(ret))
+ return ret;
+
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v4 08/16] fanotify: report file range info with pre-content events
2024-08-14 21:25 ` [PATCH v4 08/16] fanotify: report file range info with pre-content events Josef Bacik
@ 2024-08-29 10:13 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 10:13 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
On Wed 14-08-24 17:25:26, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> With group class FAN_CLASS_PRE_CONTENT, report offset and length info
> along with FAN_PRE_ACCESS and FAN_PRE_MODIFY permission events.
>
> This information is meant to be used by hierarchical storage managers
> that want to fill partial content of files on first access to range.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
...
> @@ -191,6 +192,12 @@ struct fanotify_event_info_error {
> __u32 error_count;
> };
>
> +struct fanotify_event_info_range {
> + struct fanotify_event_info_header hdr;
There will be 32-bits of padding here and for UAPI, I prefer to keep that
explicit by adding:
__u32 pad;
here. I can fix it on commit...
> + __u64 offset;
> + __u64 count;
> +};
> +
> /*
> * User space may need to record additional information about its decision.
> * The extra information type records what kind of information is included.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response
2024-08-14 21:25 ` [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
@ 2024-08-29 10:25 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 10:25 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
On Wed 14-08-24 17:25:27, Josef Bacik wrote:
> From: Amir Goldstein <amir73il@gmail.com>
>
> With FAN_DENY response, user trying to perform the filesystem operation
> gets an error with errno set to EPERM.
>
> It is useful for hierarchical storage management (HSM) service to be able
> to deny access for reasons more diverse than EPERM, for example EAGAIN,
> if HSM could retry the operation later.
>
> Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd
> to permission events with the response value FAN_DENY_ERRNO(errno),
> instead of FAN_DENY to return a custom error.
>
> Limit custom error values to errors expected on read(2)/write(2) and
> open(2) of regular files. This list could be extended in the future.
> Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by
> writing a response to an fanotify group fd with a value of FAN_NOFD in
> the fd field of the response.
>
> The change in fanotify_response is backward compatible, because errno is
> written in the high 8 bits of the 32bit response field and old kernels
> reject respose value with high bits set.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
...
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 7f06355afa1f..d0722ef13138 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -528,3 +528,13 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
>
> return mflags;
> }
> +
> +static inline u32 fanotify_get_response_decision(u32 res)
> +{
> + return res & (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS);
> +}
I'm not convinced this helper really helps readability. Probably I'll drop
it on commit...
> +
> +static inline int fanotify_get_response_errno(int res)
'res' should be u32 here. Otherwise C compiler would do arithmetic shifting
and returned errno would be wrong. I can fix this on commit.
> +{
> + return res >> FAN_ERRNO_SHIFT;
> +}
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches
2024-08-14 21:25 ` [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches Josef Bacik
@ 2024-08-29 10:48 ` Jan Kara
2024-08-29 12:44 ` Josef Bacik
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2024-08-29 10:48 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
On Wed 14-08-24 17:25:29, Josef Bacik wrote:
> With page faults we can trigger readahead on the file, and then
> subsequent faults can find these pages and insert them into the file
> without emitting an fanotify event. To avoid this case, disable
> readahead if we have pre-content watches on the file. This way we are
> guaranteed to get an event for every range we attempt to access on a
> pre-content watched file.
>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
...
> @@ -674,6 +675,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
> {
> bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
>
> + /*
> + * If we have pre-content watches we need to disable readahead to make
> + * sure that we don't find 0 filled pages in cache that we never emitted
> + * events for.
> + */
> + if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
> + return;
> +
There are callers which don't pass struct file to readahead (either to
page_cache_sync_ra() or page_cache_async_ra()). Luckily these are very few
- cramfs for a block device (we don't care) and btrfs from code paths like
send-receive or defrag. Now if you tell me you're fine breaking these
corner cases for btrfs, I'll take your word for it but it looks like a
nasty trap to me. Now doing things like defrag or send-receive on offline
files on HSM managed filesystem doesn't look like a terribly good idea
anyway so perhaps we just want btrfs to check and refuse such things?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches
2024-08-14 21:25 ` [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches Josef Bacik
@ 2024-08-29 10:51 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 10:51 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs, linux-mm
On Wed 14-08-24 17:25:30, Josef Bacik wrote:
> There's nothing stopping us from supporting this, we could simply pass
> the order into the helper and emit the proper length. However currently
> there's no tests to validate this works properly, so disable it until
> there's a desire to support this along with the appropriate tests.
>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Looks good to me. I don't expect this to be controversial but let's CC MM
guys for awareness...
Honza
> ---
> mm/memory.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d10e616d7389..3010bcc5e4f9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -78,6 +78,7 @@
> #include <linux/ptrace.h>
> #include <linux/vmalloc.h>
> #include <linux/sched/sysctl.h>
> +#include <linux/fsnotify.h>
>
> #include <trace/events/kmem.h>
>
> @@ -5252,8 +5253,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> if (vma_is_anonymous(vma))
> return do_huge_pmd_anonymous_page(vmf);
> + /*
> + * Currently we just emit PAGE_SIZE for our fault events, so don't allow
> + * a huge fault if we have a pre content watch on this file. This would
> + * be trivial to support, but there would need to be tests to ensure
> + * this works properly and those don't exist currently.
> + */
> + if (file && fsnotify_file_has_pre_content_watches(file))
> + return VM_FAULT_FALLBACK;
> if (vma->vm_ops->huge_fault)
> return vma->vm_ops->huge_fault(vmf, PMD_ORDER);
> return VM_FAULT_FALLBACK;
> @@ -5263,6 +5273,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
> static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> vm_fault_t ret;
>
> @@ -5277,6 +5288,9 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> }
>
> if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + /* See comment in create_huge_pmd. */
> + if (file && fsnotify_file_has_pre_content_watches(file))
> + goto split;
> if (vma->vm_ops->huge_fault) {
> ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER);
> if (!(ret & VM_FAULT_FALLBACK))
> @@ -5296,9 +5310,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> /* No support for anonymous transparent PUD pages yet */
> if (vma_is_anonymous(vma))
> return VM_FAULT_FALLBACK;
> + /* See comment in create_huge_pmd. */
> + if (file && fsnotify_file_has_pre_content_watches(file))
> + return VM_FAULT_FALLBACK;
> if (vma->vm_ops->huge_fault)
> return vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -5310,12 +5328,16 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> vm_fault_t ret;
>
> /* No support for anonymous transparent PUD pages yet */
> if (vma_is_anonymous(vma))
> goto split;
> if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + /* See comment in create_huge_pmd. */
> + if (file && fsnotify_file_has_pre_content_watches(file))
> + goto split;
> if (vma->vm_ops->huge_fault) {
> ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> if (!(ret & VM_FAULT_FALLBACK))
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault
2024-08-14 21:25 ` [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault Josef Bacik
@ 2024-08-29 11:07 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 11:07 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs, linux-mm
On Wed 14-08-24 17:25:31, Josef Bacik wrote:
> FS_PRE_ACCESS or FS_PRE_MODIFY will be generated on page fault depending
> on the faulting method.
>
> This pre-content event is meant to be used by hierarchical storage
> managers that want to fill in the file content on first read access.
>
> Export a simple helper that file systems that have their own ->fault()
> will use, and have a more complicated helper to be do fancy things with
> in filemap_fault.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Looks good to me. Just let's CC MM guys for awareness about changes to
filemap_fault().
Honza
> ---
> include/linux/mm.h | 1 +
> mm/filemap.c | 116 ++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 110 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ab3d78116043..3e190f0a0997 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3503,6 +3503,7 @@ extern vm_fault_t filemap_fault(struct vm_fault *vmf);
> extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> pgoff_t start_pgoff, pgoff_t end_pgoff);
> extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
> +extern vm_fault_t filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf);
>
> extern unsigned long stack_guard_gap;
> /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8b1684b62177..50e88e47dff3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -46,6 +46,7 @@
> #include <linux/pipe_fs_i.h>
> #include <linux/splice.h>
> #include <linux/rcupdate_wait.h>
> +#include <linux/fsnotify.h>
> #include <asm/pgalloc.h>
> #include <asm/tlbflush.h>
> #include "internal.h"
> @@ -3112,13 +3113,13 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
> * that. If we didn't pin a file then we return NULL. The file that is
> * returned needs to be fput()'ed when we're done with it.
> */
> -static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> +static struct file *do_sync_mmap_readahead(struct vm_fault *vmf,
> + struct file *fpin)
> {
> struct file *file = vmf->vma->vm_file;
> struct file_ra_state *ra = &file->f_ra;
> struct address_space *mapping = file->f_mapping;
> DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> - struct file *fpin = NULL;
> unsigned long vm_flags = vmf->vma->vm_flags;
> unsigned int mmap_miss;
>
> @@ -3190,12 +3191,12 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> * was pinned if we have to drop the mmap_lock in order to do IO.
> */
> static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> - struct folio *folio)
> + struct folio *folio,
> + struct file *fpin)
> {
> struct file *file = vmf->vma->vm_file;
> struct file_ra_state *ra = &file->f_ra;
> DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
> - struct file *fpin = NULL;
> unsigned int mmap_miss;
>
> /* See comment in do_sync_mmap_readahead. */
> @@ -3260,6 +3261,93 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
> return ret;
> }
>
> +/*
> + * If we have pre-content watches on this file we will need to emit an event for
> + * this range. We will handle dropping the lock and emitting the event.
> + *
> + * If FAULT_FLAG_RETRY_NOWAIT is set then we'll return VM_FAULT_RETRY.
> + *
> + * If no event was emitted then *fpin will be NULL and we will return 0.
> + *
> + * If any error occurred we will return VM_FAULT_SIGBUS, *fpin could still be
> + * set and will need to have fput() called on it.
> + *
> + * If we emitted the event then we will return 0 and *fpin will be set, this
> + * must have fput() called on it, and the caller must call VM_FAULT_RETRY after
> + * any other operations it does in order to re-fault the page and make sure the
> + * appropriate locking is maintained.
> + *
> + * Return: the appropriate vm_fault_t return code, 0 on success.
> + */
> +static vm_fault_t __filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf,
> + struct file **fpin)
> +{
> + struct file *file = vmf->vma->vm_file;
> + loff_t pos = vmf->pgoff << PAGE_SHIFT;
> + int mask = (vmf->flags & FAULT_FLAG_WRITE) ? MAY_WRITE : MAY_ACCESS;
> + int ret;
> +
> + /*
> + * We already did this and now we're retrying with everything locked,
> + * don't emit the event and continue.
> + */
> + if (vmf->flags & FAULT_FLAG_TRIED)
> + return 0;
> +
> + /* No watches, return NULL. */
> + if (!fsnotify_file_has_pre_content_watches(file))
> + return 0;
> +
> + /* We are NOWAIT, we can't wait, just return EAGAIN. */
> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> + return VM_FAULT_RETRY;
> +
> + /*
> + * If this fails then we're not allowed to drop the fault lock, return a
> + * SIGBUS so we don't errantly populate pagecache with bogus data for
> + * this file.
> + */
> + *fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
> + if (*fpin == NULL)
> + return VM_FAULT_SIGBUS;
> +
> + /*
> + * We can't fput(*fpin) at this point because we could have been passed
> + * in fpin from a previous call.
> + */
> + ret = fsnotify_file_area_perm(*fpin, mask, &pos, PAGE_SIZE);
> + if (ret)
> + return VM_FAULT_SIGBUS;
> +
> + return 0;
> +}
> +
> +/**
> + * filemap_maybe_emit_fsnotify_event - maybe emit a pre-content event.
> + * @vmf: struct vm_fault containing details of the fault.
> + *
> + * If we have a pre-content watch on this file we will emit an event for this
> + * range. If we return anything the fault caller should return immediately, we
> + * will return VM_FAULT_RETRY if we had to emit an event, which will trigger the
> + * fault again and then the fault handler will run the second time through.
> + *
> + * Return: a bitwise-OR of %VM_FAULT_ codes, 0 if nothing happened.
> + */
> +vm_fault_t filemap_maybe_emit_fsnotify_event(struct vm_fault *vmf)
> +{
> + struct file *fpin = NULL;
> + vm_fault_t ret;
> +
> + ret = __filemap_maybe_emit_fsnotify_event(vmf, &fpin);
> + if (fpin) {
> + fput(fpin);
> + if (!ret)
> + ret = VM_FAULT_RETRY;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(filemap_maybe_emit_fsnotify_event);
> +
> /**
> * filemap_fault - read in file data for page fault handling
> * @vmf: struct vm_fault containing details of the fault
> @@ -3299,6 +3387,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> if (unlikely(index >= max_idx))
> return VM_FAULT_SIGBUS;
>
> + /*
> + * If we have pre-content watchers then we need to generate events on
> + * page fault so that we can populate any data before the fault.
> + */
> + ret = __filemap_maybe_emit_fsnotify_event(vmf, &fpin);
> + if (unlikely(ret)) {
> + if (fpin)
> + fput(fpin);
> + return ret;
> + }
> +
> /*
> * Do we have something in the page cache already?
> */
> @@ -3309,21 +3408,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> * the lock.
> */
> if (!(vmf->flags & FAULT_FLAG_TRIED))
> - fpin = do_async_mmap_readahead(vmf, folio);
> + fpin = do_async_mmap_readahead(vmf, folio, fpin);
> if (unlikely(!folio_test_uptodate(folio))) {
> filemap_invalidate_lock_shared(mapping);
> mapping_locked = true;
> }
> } else {
> ret = filemap_fault_recheck_pte_none(vmf);
> - if (unlikely(ret))
> + if (unlikely(ret)) {
> + if (fpin)
> + goto out_retry;
> return ret;
> + }
>
> /* No page in the page cache at all */
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
> - fpin = do_sync_mmap_readahead(vmf);
> + fpin = do_sync_mmap_readahead(vmf, fpin);
> retry_find:
> /*
> * See comment in filemap_create_folio() why we need
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-14 21:25 ` [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault Josef Bacik
@ 2024-08-29 11:10 ` Jan Kara
2024-08-29 11:26 ` Kent Overstreet
2024-08-29 12:25 ` Kent Overstreet
0 siblings, 2 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 11:10 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs, Kent Overstreet
On Wed 14-08-24 17:25:32, Josef Bacik wrote:
> bcachefs has its own locking around filemap_fault, so we have to make
> sure we do the fsnotify hook before the locking. Add the check to emit
> the event before the locking and return VM_FAULT_RETRY to retrigger the
> fault once the event has been emitted.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Looks good to me. Would be nice to get ack from bcachefs guys. Kent?
Honza
> ---
> fs/bcachefs/fs-io-pagecache.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/bcachefs/fs-io-pagecache.c b/fs/bcachefs/fs-io-pagecache.c
> index a9cc5cad9cc9..1fa1f1ac48c8 100644
> --- a/fs/bcachefs/fs-io-pagecache.c
> +++ b/fs/bcachefs/fs-io-pagecache.c
> @@ -570,6 +570,10 @@ vm_fault_t bch2_page_fault(struct vm_fault *vmf)
> if (fdm == mapping)
> return VM_FAULT_SIGBUS;
>
> + ret = filemap_maybe_emit_fsnotify_event(vmf);
> + if (unlikely(ret))
> + return ret;
> +
> /* Lock ordering: */
> if (fdm > mapping) {
> struct bch_inode_info *fdm_host = to_bch_ei(fdm->host);
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-14 21:25 ` [PATCH v4 15/16] gfs2: " Josef Bacik
@ 2024-08-29 11:15 ` Jan Kara
2024-08-29 11:26 ` Amir Goldstein
2024-08-29 12:42 ` Josef Bacik
0 siblings, 2 replies; 40+ messages in thread
From: Jan Kara @ 2024-08-29 11:15 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs, Andreas Gruenbacher
On Wed 14-08-24 17:25:33, Josef Bacik wrote:
> gfs2 takes the glock before calling into filemap fault, so add the
> fsnotify hook for ->fault before we take the glock in order to avoid any
> possible deadlock with the HSM.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
The idea of interactions between GFS2 cluster locking and HSM gives me
creeps. But yes, this patch looks good to me. Would be nice to get ack from
GFS2 guys. Andreas?
Honza
> ---
> fs/gfs2/file.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 08982937b5df..d4af70d765e0 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -556,6 +556,10 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
> vm_fault_t ret;
> int err;
>
> + ret = filemap_maybe_emit_fsnotify_event(vmf);
> + if (unlikely(ret))
> + return ret;
> +
> gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
> err = gfs2_glock_nq(&gh);
> if (err) {
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults
2024-08-14 21:25 ` [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-08-29 11:17 ` Jan Kara
2024-08-30 23:28 ` Darrick J. Wong
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2024-08-29 11:17 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs, Darrick J. Wong, Dave Chinner,
Christoph Hellwig
On Wed 14-08-24 17:25:34, Josef Bacik wrote:
> xfs has it's own handling for write faults, so we need to add the
> pre-content fsnotify hook for this case. Reads go through filemap_fault
> so they're handled properly there.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Looks good to me but it would be great to get explicit ack from some XFS
guy... Some selection CCed :)
Honza
> ---
> fs/xfs/xfs_file.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 4cdc54dc9686..e61c4c389d7d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1283,6 +1283,10 @@ xfs_write_fault(
> unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
> vm_fault_t ret;
>
> + ret = filemap_maybe_emit_fsnotify_event(vmf);
> + if (unlikely(ret))
> + return ret;
> +
> sb_start_pagefault(inode->i_sb);
> file_update_time(vmf->vma->vm_file);
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-29 11:15 ` Jan Kara
@ 2024-08-29 11:26 ` Amir Goldstein
2024-08-29 11:43 ` Jan Kara
2024-08-29 12:42 ` Josef Bacik
1 sibling, 1 reply; 40+ messages in thread
From: Amir Goldstein @ 2024-08-29 11:26 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, linux-xfs, gfs2,
linux-bcachefs, Andreas Gruenbacher
On Thu, Aug 29, 2024 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 14-08-24 17:25:33, Josef Bacik wrote:
> > gfs2 takes the glock before calling into filemap fault, so add the
> > fsnotify hook for ->fault before we take the glock in order to avoid any
> > possible deadlock with the HSM.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> The idea of interactions between GFS2 cluster locking and HSM gives me
> creeps. But yes, this patch looks good to me. Would be nice to get ack from
> GFS2 guys. Andreas?
If we are being honest, I think that the fact that HSM events require careful
handling in ->fault() and not to mention no documentation of this fact,
perhaps we should let HSM events be an opt-in file_system_type feature?
Additionally, we had to introduce FS_DISALLOW_NOTIFY_PERM
and restrict sb marks on SB_NOUSER, all because these fanotify
features did not require fs opt-in to begin with.
I think we would be repeating this mistake if we do not add
FS_ALLOW_HSM from the start.
After all, I cannot imagine HSM being used on anything but
the major disk filesystems.
Hmm?
Amir.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-29 11:10 ` Jan Kara
@ 2024-08-29 11:26 ` Kent Overstreet
2024-08-29 12:46 ` Josef Bacik
2024-08-29 12:25 ` Kent Overstreet
1 sibling, 1 reply; 40+ messages in thread
From: Kent Overstreet @ 2024-08-29 11:26 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, amir73il, brauner,
linux-xfs, gfs2, linux-bcachefs
On Thu, Aug 29, 2024 at 01:10:55PM GMT, Jan Kara wrote:
> On Wed 14-08-24 17:25:32, Josef Bacik wrote:
> > bcachefs has its own locking around filemap_fault, so we have to make
> > sure we do the fsnotify hook before the locking. Add the check to emit
> > the event before the locking and return VM_FAULT_RETRY to retrigger the
> > fault once the event has been emitted.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Looks good to me. Would be nice to get ack from bcachefs guys. Kent?
I said I wanted the bcachefs side tested, and offered Josef CI access
for that - still waiting to hear from him.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-29 11:26 ` Amir Goldstein
@ 2024-08-29 11:43 ` Jan Kara
2024-08-29 12:49 ` Amir Goldstein
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2024-08-29 11:43 UTC (permalink / raw)
To: Amir Goldstein
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, brauner,
linux-xfs, gfs2, linux-bcachefs, Andreas Gruenbacher
On Thu 29-08-24 13:26:17, Amir Goldstein wrote:
> On Thu, Aug 29, 2024 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 14-08-24 17:25:33, Josef Bacik wrote:
> > > gfs2 takes the glock before calling into filemap fault, so add the
> > > fsnotify hook for ->fault before we take the glock in order to avoid any
> > > possible deadlock with the HSM.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >
> > The idea of interactions between GFS2 cluster locking and HSM gives me
> > creeps. But yes, this patch looks good to me. Would be nice to get ack from
> > GFS2 guys. Andreas?
>
> If we are being honest, I think that the fact that HSM events require careful
> handling in ->fault() and not to mention no documentation of this fact,
> perhaps we should let HSM events be an opt-in file_system_type feature?
>
> Additionally, we had to introduce FS_DISALLOW_NOTIFY_PERM
> and restrict sb marks on SB_NOUSER, all because these fanotify
> features did not require fs opt-in to begin with.
>
> I think we would be repeating this mistake if we do not add
> FS_ALLOW_HSM from the start.
>
> After all, I cannot imagine HSM being used on anything but
> the major disk filesystems.
>
> Hmm?
Yeah, I was considering this already when thinking about btrfs quirks with
readahead and various special filesystem ioctls and I agree that a need to be
careful with page faults is another good reason to make this a
per-filesystem opt in. Will you send a patch?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-29 11:10 ` Jan Kara
2024-08-29 11:26 ` Kent Overstreet
@ 2024-08-29 12:25 ` Kent Overstreet
1 sibling, 0 replies; 40+ messages in thread
From: Kent Overstreet @ 2024-08-29 12:25 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, amir73il, brauner,
linux-xfs, gfs2, linux-bcachefs
On Thu, Aug 29, 2024 at 01:10:55PM GMT, Jan Kara wrote:
> On Wed 14-08-24 17:25:32, Josef Bacik wrote:
> > bcachefs has its own locking around filemap_fault, so we have to make
> > sure we do the fsnotify hook before the locking. Add the check to emit
> > the event before the locking and return VM_FAULT_RETRY to retrigger the
> > fault once the event has been emitted.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Looks good to me. Would be nice to get ack from bcachefs guys. Kent?
btw, happy to give you a CI account as well: just need username and ssh
pubkey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-29 11:15 ` Jan Kara
2024-08-29 11:26 ` Amir Goldstein
@ 2024-08-29 12:42 ` Josef Bacik
1 sibling, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-29 12:42 UTC (permalink / raw)
To: Jan Kara
Cc: kernel-team, linux-fsdevel, amir73il, brauner, linux-xfs, gfs2,
linux-bcachefs, Andreas Gruenbacher
On Thu, Aug 29, 2024 at 01:15:10PM +0200, Jan Kara wrote:
> On Wed 14-08-24 17:25:33, Josef Bacik wrote:
> > gfs2 takes the glock before calling into filemap fault, so add the
> > fsnotify hook for ->fault before we take the glock in order to avoid any
> > possible deadlock with the HSM.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> The idea of interactions between GFS2 cluster locking and HSM gives me
> creeps. But yes, this patch looks good to me. Would be nice to get ack from
> GFS2 guys. Andreas?
I did a lot of gfs2 work originally so I'm familiar with how it works, otherwise
I definitely would have just left it off.
That being said I'd also be fine with just gating it at an FS level. Thanks,
Josef
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches
2024-08-29 10:48 ` Jan Kara
@ 2024-08-29 12:44 ` Josef Bacik
0 siblings, 0 replies; 40+ messages in thread
From: Josef Bacik @ 2024-08-29 12:44 UTC (permalink / raw)
To: Jan Kara
Cc: kernel-team, linux-fsdevel, amir73il, brauner, linux-xfs, gfs2,
linux-bcachefs
On Thu, Aug 29, 2024 at 12:48:05PM +0200, Jan Kara wrote:
> On Wed 14-08-24 17:25:29, Josef Bacik wrote:
> > With page faults we can trigger readahead on the file, and then
> > subsequent faults can find these pages and insert them into the file
> > without emitting an fanotify event. To avoid this case, disable
> > readahead if we have pre-content watches on the file. This way we are
> > guaranteed to get an event for every range we attempt to access on a
> > pre-content watched file.
> >
> > Reviewed-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> ...
>
> > @@ -674,6 +675,14 @@ void page_cache_sync_ra(struct readahead_control *ractl,
> > {
> > bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
> >
> > + /*
> > + * If we have pre-content watches we need to disable readahead to make
> > + * sure that we don't find 0 filled pages in cache that we never emitted
> > + * events for.
> > + */
> > + if (ractl->file && fsnotify_file_has_pre_content_watches(ractl->file))
> > + return;
> > +
>
> There are callers which don't pass struct file to readahead (either to
> page_cache_sync_ra() or page_cache_async_ra()). Luckily these are very few
> - cramfs for a block device (we don't care) and btrfs from code paths like
> send-receive or defrag. Now if you tell me you're fine breaking these
> corner cases for btrfs, I'll take your word for it but it looks like a
> nasty trap to me. Now doing things like defrag or send-receive on offline
> files on HSM managed filesystem doesn't look like a terribly good idea
> anyway so perhaps we just want btrfs to check and refuse such things?
>
We can't have HSM on a send subvolume because they have to be read only. I
hadn't thought of defrag, I'll respin and add a patch to disallow defrag on a
file that has content watches. Thanks,
Josef
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-29 11:26 ` Kent Overstreet
@ 2024-08-29 12:46 ` Josef Bacik
2024-08-29 12:55 ` Kent Overstreet
0 siblings, 1 reply; 40+ messages in thread
From: Josef Bacik @ 2024-08-29 12:46 UTC (permalink / raw)
To: Kent Overstreet
Cc: Jan Kara, kernel-team, linux-fsdevel, amir73il, brauner,
linux-xfs, gfs2, linux-bcachefs
On Thu, Aug 29, 2024 at 07:26:42AM -0400, Kent Overstreet wrote:
> On Thu, Aug 29, 2024 at 01:10:55PM GMT, Jan Kara wrote:
> > On Wed 14-08-24 17:25:32, Josef Bacik wrote:
> > > bcachefs has its own locking around filemap_fault, so we have to make
> > > sure we do the fsnotify hook before the locking. Add the check to emit
> > > the event before the locking and return VM_FAULT_RETRY to retrigger the
> > > fault once the event has been emitted.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >
> > Looks good to me. Would be nice to get ack from bcachefs guys. Kent?
>
> I said I wanted the bcachefs side tested, and offered Josef CI access
> for that - still waiting to hear from him.
My bad I thought I had responded. I tested bcachefs, xfs, ext4, and btrfs with
my tests. I'll get those turned into fstests today/tomorrow. Thanks,
Josef
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 15/16] gfs2: add pre-content fsnotify hook to fault
2024-08-29 11:43 ` Jan Kara
@ 2024-08-29 12:49 ` Amir Goldstein
0 siblings, 0 replies; 40+ messages in thread
From: Amir Goldstein @ 2024-08-29 12:49 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, brauner, linux-xfs, gfs2,
linux-bcachefs, Andreas Gruenbacher
On Thu, Aug 29, 2024 at 1:43 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 29-08-24 13:26:17, Amir Goldstein wrote:
> > On Thu, Aug 29, 2024 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 14-08-24 17:25:33, Josef Bacik wrote:
> > > > gfs2 takes the glock before calling into filemap fault, so add the
> > > > fsnotify hook for ->fault before we take the glock in order to avoid any
> > > > possible deadlock with the HSM.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > >
> > > The idea of interactions between GFS2 cluster locking and HSM gives me
> > > creeps. But yes, this patch looks good to me. Would be nice to get ack from
> > > GFS2 guys. Andreas?
> >
> > If we are being honest, I think that the fact that HSM events require careful
> > handling in ->fault() and not to mention no documentation of this fact,
> > perhaps we should let HSM events be an opt-in file_system_type feature?
> >
> > Additionally, we had to introduce FS_DISALLOW_NOTIFY_PERM
> > and restrict sb marks on SB_NOUSER, all because these fanotify
> > features did not require fs opt-in to begin with.
> >
> > I think we would be repeating this mistake if we do not add
> > FS_ALLOW_HSM from the start.
> >
> > After all, I cannot imagine HSM being used on anything but
> > the major disk filesystems.
> >
> > Hmm?
>
> Yeah, I was considering this already when thinking about btrfs quirks with
> readahead and various special filesystem ioctls and I agree that a need to be
> careful with page faults is another good reason to make this a
> per-filesystem opt in. Will you send a patch?
Huh! I am still struggling to keep my head above the water
coming back from a long vacation, so I think it is better if Josef takes
that as well.
But here is a trivial, untested, probably broken, space damaged patch,
that should be broken and squashed into other patches in the series
if this is any help at all...
Thanks,
Amir.
index c3c8b2ea80b6..07c3cf038221 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1672,6 +1672,11 @@ static int fanotify_events_supported(struct
fsnotify_group *group,
(mask & FAN_RENAME) ||
(flags & FAN_MARK_IGNORE);
+ /* Filesystems need to opt-into pre-content evnets (a.k.a HSM) */
+ if (mask & FANOTIFY_PRE_CONTENT_EVENTS &&
+ path->mnt->mnt_sb->s_type->fs_flags & FS_ALLOW_HSM)
+ return -EINVAL;
+
/*
* Some filesystems such as 'proc' acquire unusual locks when opening
* files. For them fanotify permission events have high chances of
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index cbfaa000f815..32690e5c4815 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -207,7 +207,8 @@ bool fsnotify_file_has_pre_content_watches(struct
file *file)
struct inode *inode = file_inode(file);
__u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask;
- return fsnotify_object_watched(inode, mnt_mask,
+ return inode->i_sb->s_type->fs_flags & FS_ALLOW_HSM &&
+ fsnotify_object_watched(inode, mnt_mask,
FSNOTIFY_PRE_CONTENT_EVENTS);
}
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fb0426f349fc..14468736d15b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2499,6 +2499,7 @@ struct file_system_type {
#define FS_USERNS_MOUNT 8 /* Can be mounted by
userns root */
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify
permission events */
#define FS_ALLOW_IDMAP 32 /* FS has been updated to
handle vfs idmappings. */
+#define FS_ALLOW_HSM 64 /* FS can handle fanotify
pre-content events. */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
during rename() internally. */
int (*init_fs_context)(struct fs_context *);
const struct fs_parameter_spec *parameters;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index ec1e88342442..b6845ab477d6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -178,6 +178,7 @@ static inline int fsnotify_file_area_perm(struct
file *file, int perm_mask,
* if there are any pre-content event watchers on this sb.
*/
if ((!S_ISDIR(inode->i_mode) && !S_ISREG(inode->i_mode)) ||
+ !(inode->i_sb->s_type->fs_flags & FS_ALLOW_HSM) ||
!fsnotify_sb_has_priority_watchers(inode->i_sb,
FSNOTIFY_PRIO_PRE_CONTENT))
return 0;
---
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault
2024-08-29 12:46 ` Josef Bacik
@ 2024-08-29 12:55 ` Kent Overstreet
0 siblings, 0 replies; 40+ messages in thread
From: Kent Overstreet @ 2024-08-29 12:55 UTC (permalink / raw)
To: Josef Bacik
Cc: Jan Kara, kernel-team, linux-fsdevel, amir73il, brauner,
linux-xfs, gfs2, linux-bcachefs
On Thu, Aug 29, 2024 at 08:46:07AM GMT, Josef Bacik wrote:
> On Thu, Aug 29, 2024 at 07:26:42AM -0400, Kent Overstreet wrote:
> > On Thu, Aug 29, 2024 at 01:10:55PM GMT, Jan Kara wrote:
> > > On Wed 14-08-24 17:25:32, Josef Bacik wrote:
> > > > bcachefs has its own locking around filemap_fault, so we have to make
> > > > sure we do the fsnotify hook before the locking. Add the check to emit
> > > > the event before the locking and return VM_FAULT_RETRY to retrigger the
> > > > fault once the event has been emitted.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > >
> > > Looks good to me. Would be nice to get ack from bcachefs guys. Kent?
> >
> > I said I wanted the bcachefs side tested, and offered Josef CI access
> > for that - still waiting to hear from him.
>
> My bad I thought I had responded. I tested bcachefs, xfs, ext4, and btrfs with
> my tests. I'll get those turned into fstests today/tomorrow. Thanks,
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 00/16] fanotify: add pre-content hooks
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
` (15 preceding siblings ...)
2024-08-14 21:25 ` [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults Josef Bacik
@ 2024-08-29 21:41 ` Darrick J. Wong
2024-08-30 8:55 ` Amir Goldstein
16 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2024-08-29 21:41 UTC (permalink / raw)
To: Josef Bacik
Cc: kernel-team, linux-fsdevel, jack, amir73il, brauner, linux-xfs,
gfs2, linux-bcachefs
On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote:
> v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
>
> v3->v4:
> - Trying to send a final verson Friday at 5pm before you go on vacation is a
> recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
> review.
> - Reworked the file system helper so it's handling of fpin was a little less
> silly, per Chinner's suggestion.
> - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
> in filemap_fault that says if VM_FAULT_ERROR is set we won't have
> VM_FAULT_RETRY set.
>
> v2->v3:
> - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
> emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
> Amir's suggestion.
> - Reworked the exported helper so the per-filesystem changes are much smaller,
> per Amir's suggestion.
> - Fixed the screwup for DAX writes per Chinner's suggestion.
> - Added Christian's reviewed-by's where appropriate.
>
> v1->v2:
> - reworked the page fault logic based on Jan's suggestion and turned it into a
> helper.
> - Added 3 patches per-fs where we need to call the fsnotify helper from their
> ->fault handlers.
> - Disabled readahead in the case that there's a pre-content watch in place.
> - Disabled huge faults when there's a pre-content watch in place (entirely
> because it's untested, theoretically it should be straightforward to do).
> - Updated the command numbers.
> - Addressed the random spelling/grammer mistakes that Jan pointed out.
> - Addressed the other random nits from Jan.
>
> --- Original email ---
>
> Hello,
>
> These are the patches for the bare bones pre-content fanotify support. The
> majority of this work is Amir's, my contribution to this has solely been around
> adding the page fault hooks, testing and validating everything. I'm sending it
> because Amir is traveling a bunch, and I touched it last so I'm going to take
> all the hate and he can take all the credit.
>
> There is a PoC that I've been using to validate this work, you can find the git
> repo here
>
> https://github.com/josefbacik/remote-fetch
>
> This consists of 3 different tools.
>
> 1. populate. This just creates all the stub files in the directory from the
> source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
> recursively create all of the stub files and directories.
> 2. remote-fetch. This is the actual PoC, you just point it at the source and
> destination directory and then you can do whatever. ./remote-fetch ~/linux
> ~/hsm-linux.
> 3. mmap-validate. This was to validate the pagefault thing, this is likely what
> will be turned into the selftest with remote-fetch. It creates a file and
> then you can validate the file matches the right pattern with both normal
> reads and mmap. Normally I do something like
>
> ./mmap-validate create ~/src/foo
> ./populate ~/src ~/dst
> ./rmeote-fetch ~/src ~/dst
> ./mmap-validate validate ~/dst/foo
>
> I did a bunch of testing, I also got some performance numbers. I copied a
> kernel tree, and then did remote-fetch, and then make -j4
>
> Normal
> real 9m49.709s
> user 28m11.372s
> sys 4m57.304s
>
> HSM
> real 10m6.454s
> user 29m10.517s
> sys 5m2.617s
>
> So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
> to see the size
>
> [root@fedora ~]# du -hs /src/linux
> 1.6G /src/linux
> [root@fedora ~]# du -hs dst
> 125M dst
>
> This mirrors the sort of savings we've seen in production.
>
> Meta has had these patches (minus the page fault patch) deployed in production
> for almost a year with our own utility for doing on-demand package fetching.
> The savings from this has been pretty significant.
>
> The page-fault hooks are necessary for the last thing we need, which is
> on-demand range fetching of executables. Some of our binaries are several gigs
> large, having the ability to remote fetch them on demand is a huge win for us
> not only with space savings, but with startup time of containers.
So... does this pre-content fetcher already work for regular reads and
writes, and now you're looking to wire up page faults? Or does it only
handle page faults? Judging from Amir's patches I'm guessing the
FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications
prior to read and write page faults? The XFS change looks reasonable to
me, but I'm left wondering "what does this shiny /do/?"
--D
> There will be tests for this going into LTP once we're satisfied with the
> patches and they're on their way upstream. Thanks,
>
> Josef
>
> Amir Goldstein (8):
> fsnotify: introduce pre-content permission event
> fsnotify: generate pre-content permission event on open
> fanotify: introduce FAN_PRE_ACCESS permission event
> fanotify: introduce FAN_PRE_MODIFY permission event
> fanotify: pass optional file access range in pre-content event
> fanotify: rename a misnamed constant
> fanotify: report file range info with pre-content events
> fanotify: allow to set errno in FAN_DENY permission response
>
> Josef Bacik (8):
> fanotify: don't skip extra event info if no info_mode is set
> fanotify: add a helper to check for pre content events
> fanotify: disable readahead if we have pre-content watches
> mm: don't allow huge faults for files with pre content watches
> fsnotify: generate pre-content permission event on page fault
> bcachefs: add pre-content fsnotify hook to fault
> gfs2: add pre-content fsnotify hook to fault
> xfs: add pre-content fsnotify hook for write faults
>
> fs/bcachefs/fs-io-pagecache.c | 4 +
> fs/gfs2/file.c | 4 +
> fs/namei.c | 9 ++
> fs/notify/fanotify/fanotify.c | 32 ++++++--
> fs/notify/fanotify/fanotify.h | 20 +++++
> fs/notify/fanotify/fanotify_user.c | 116 +++++++++++++++++++++-----
> fs/notify/fsnotify.c | 14 +++-
> fs/xfs/xfs_file.c | 4 +
> include/linux/fanotify.h | 20 +++--
> include/linux/fsnotify.h | 54 ++++++++++--
> include/linux/fsnotify_backend.h | 59 ++++++++++++-
> include/linux/mm.h | 1 +
> include/uapi/linux/fanotify.h | 17 ++++
> mm/filemap.c | 128 +++++++++++++++++++++++++++--
> mm/memory.c | 22 +++++
> mm/readahead.c | 13 +++
> security/selinux/hooks.c | 3 +-
> 17 files changed, 469 insertions(+), 51 deletions(-)
>
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 00/16] fanotify: add pre-content hooks
2024-08-29 21:41 ` [PATCH v4 00/16] fanotify: add pre-content hooks Darrick J. Wong
@ 2024-08-30 8:55 ` Amir Goldstein
2024-08-30 23:22 ` Darrick J. Wong
0 siblings, 1 reply; 40+ messages in thread
From: Amir Goldstein @ 2024-08-30 8:55 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner, linux-xfs,
gfs2, linux-bcachefs
On Thu, Aug 29, 2024 at 11:41 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote:
> > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
> >
> > v3->v4:
> > - Trying to send a final verson Friday at 5pm before you go on vacation is a
> > recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
> > review.
> > - Reworked the file system helper so it's handling of fpin was a little less
> > silly, per Chinner's suggestion.
> > - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
> > in filemap_fault that says if VM_FAULT_ERROR is set we won't have
> > VM_FAULT_RETRY set.
> >
> > v2->v3:
> > - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
> > emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
> > Amir's suggestion.
> > - Reworked the exported helper so the per-filesystem changes are much smaller,
> > per Amir's suggestion.
> > - Fixed the screwup for DAX writes per Chinner's suggestion.
> > - Added Christian's reviewed-by's where appropriate.
> >
> > v1->v2:
> > - reworked the page fault logic based on Jan's suggestion and turned it into a
> > helper.
> > - Added 3 patches per-fs where we need to call the fsnotify helper from their
> > ->fault handlers.
> > - Disabled readahead in the case that there's a pre-content watch in place.
> > - Disabled huge faults when there's a pre-content watch in place (entirely
> > because it's untested, theoretically it should be straightforward to do).
> > - Updated the command numbers.
> > - Addressed the random spelling/grammer mistakes that Jan pointed out.
> > - Addressed the other random nits from Jan.
> >
> > --- Original email ---
> >
> > Hello,
> >
> > These are the patches for the bare bones pre-content fanotify support. The
> > majority of this work is Amir's, my contribution to this has solely been around
> > adding the page fault hooks, testing and validating everything. I'm sending it
> > because Amir is traveling a bunch, and I touched it last so I'm going to take
> > all the hate and he can take all the credit.
> >
> > There is a PoC that I've been using to validate this work, you can find the git
> > repo here
> >
> > https://github.com/josefbacik/remote-fetch
> >
> > This consists of 3 different tools.
> >
> > 1. populate. This just creates all the stub files in the directory from the
> > source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
> > recursively create all of the stub files and directories.
> > 2. remote-fetch. This is the actual PoC, you just point it at the source and
> > destination directory and then you can do whatever. ./remote-fetch ~/linux
> > ~/hsm-linux.
> > 3. mmap-validate. This was to validate the pagefault thing, this is likely what
> > will be turned into the selftest with remote-fetch. It creates a file and
> > then you can validate the file matches the right pattern with both normal
> > reads and mmap. Normally I do something like
> >
> > ./mmap-validate create ~/src/foo
> > ./populate ~/src ~/dst
> > ./rmeote-fetch ~/src ~/dst
> > ./mmap-validate validate ~/dst/foo
> >
> > I did a bunch of testing, I also got some performance numbers. I copied a
> > kernel tree, and then did remote-fetch, and then make -j4
> >
> > Normal
> > real 9m49.709s
> > user 28m11.372s
> > sys 4m57.304s
> >
> > HSM
> > real 10m6.454s
> > user 29m10.517s
> > sys 5m2.617s
> >
> > So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
> > to see the size
> >
> > [root@fedora ~]# du -hs /src/linux
> > 1.6G /src/linux
> > [root@fedora ~]# du -hs dst
> > 125M dst
> >
> > This mirrors the sort of savings we've seen in production.
> >
> > Meta has had these patches (minus the page fault patch) deployed in production
> > for almost a year with our own utility for doing on-demand package fetching.
> > The savings from this has been pretty significant.
> >
> > The page-fault hooks are necessary for the last thing we need, which is
> > on-demand range fetching of executables. Some of our binaries are several gigs
> > large, having the ability to remote fetch them on demand is a huge win for us
> > not only with space savings, but with startup time of containers.
>
> So... does this pre-content fetcher already work for regular reads and
> writes, and now you're looking to wire up page faults? Or does it only
> handle page faults? Judging from Amir's patches I'm guessing the
> FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications
> prior to read and write page faults? The XFS change looks reasonable to
> me, but I'm left wondering "what does this shiny /do/?"
>
I *think* I understand the confusion.
Let me try to sort it out.
This patch set collaboration aims to add the functionality of HSM
service by adding events FS_PRE_{ACCESS,MODIFY} prior to
read/write and page faults.
Maybe you are puzzled by not seeing any new read/write hooks?
This is because the HSM events utilize the existing fsnotify_file_*perm()
hooks that are in place for the legacy FS_ACCESS_PERM event.
So why is a new FS_PRE_ACCESS needed?
Let me first quote commit message of patch 2/16 [1]:
---
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
but it meant for a different use case of filling file content before
access to a file range, so it has slightly different semantics.
Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, same as
we did for FS_OPEN_PERM/FS_OPEN_EXEC_PERM.
FS_PRE_MODIFY is a new permission event, with similar semantics as
FS_PRE_ACCESS, which is called before a file is modified.
FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
pre-content events are only reported for regular files and dirs.
The pre-content events are meant to be used by hierarchical storage
managers that want to fill the content of files on first access.
---
And from my man page draft [2]:
---
FAN_PRE_ACCESS (since Linux 6.x)
Create an event before a read from a directory or a file range,
that provides an opportunity for the event listener to
modify the content of the object
before the reader is going to access the content in the
specified range.
An additional information record of type
FAN_EVENT_INFO_TYPE_RANGE is returned
for each event in the read buffer.
...
---
So the semantics of the two events is slightly different, but also the
meaning of "an opportunity for the event listener to modify the content".
FS_ACCESS_PERM already provided this opportunity on old kernels,
but prior to "Tidy up file permission hooks" series [3] in v6.8, writing
file content in FS_ACCESS_PERM event context was prone to deadlocks.
Therefore, an application using FS_ACCESS_PERM may be prone
for deadlocks, while an application using FAN_PRE_ACCESS should
be safe in that regard.
Thanks,
Amir.
[1] https://lore.kernel.org/all/a96217d84dfebb15582a04524dc9821ba3ea1406.1723670362.git.josef@toxicpanda.com/
[2] https://github.com/amir73il/man-pages/commits/fan_pre_path
[3] https://lore.kernel.org/all/20231122122715.2561213-1-amir73il@gmail.com/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 00/16] fanotify: add pre-content hooks
2024-08-30 8:55 ` Amir Goldstein
@ 2024-08-30 23:22 ` Darrick J. Wong
0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2024-08-30 23:22 UTC (permalink / raw)
To: Amir Goldstein
Cc: Josef Bacik, kernel-team, linux-fsdevel, jack, brauner, linux-xfs,
gfs2, linux-bcachefs
On Fri, Aug 30, 2024 at 10:55:10AM +0200, Amir Goldstein wrote:
> On Thu, Aug 29, 2024 at 11:41 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote:
> > > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> > > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> > > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
> > >
> > > v3->v4:
> > > - Trying to send a final verson Friday at 5pm before you go on vacation is a
> > > recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
> > > review.
> > > - Reworked the file system helper so it's handling of fpin was a little less
> > > silly, per Chinner's suggestion.
> > > - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
> > > in filemap_fault that says if VM_FAULT_ERROR is set we won't have
> > > VM_FAULT_RETRY set.
> > >
> > > v2->v3:
> > > - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
> > > emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
> > > Amir's suggestion.
> > > - Reworked the exported helper so the per-filesystem changes are much smaller,
> > > per Amir's suggestion.
> > > - Fixed the screwup for DAX writes per Chinner's suggestion.
> > > - Added Christian's reviewed-by's where appropriate.
> > >
> > > v1->v2:
> > > - reworked the page fault logic based on Jan's suggestion and turned it into a
> > > helper.
> > > - Added 3 patches per-fs where we need to call the fsnotify helper from their
> > > ->fault handlers.
> > > - Disabled readahead in the case that there's a pre-content watch in place.
> > > - Disabled huge faults when there's a pre-content watch in place (entirely
> > > because it's untested, theoretically it should be straightforward to do).
> > > - Updated the command numbers.
> > > - Addressed the random spelling/grammer mistakes that Jan pointed out.
> > > - Addressed the other random nits from Jan.
> > >
> > > --- Original email ---
> > >
> > > Hello,
> > >
> > > These are the patches for the bare bones pre-content fanotify support. The
> > > majority of this work is Amir's, my contribution to this has solely been around
> > > adding the page fault hooks, testing and validating everything. I'm sending it
> > > because Amir is traveling a bunch, and I touched it last so I'm going to take
> > > all the hate and he can take all the credit.
> > >
> > > There is a PoC that I've been using to validate this work, you can find the git
> > > repo here
> > >
> > > https://github.com/josefbacik/remote-fetch
> > >
> > > This consists of 3 different tools.
> > >
> > > 1. populate. This just creates all the stub files in the directory from the
> > > source directory. Just run ./populate ~/linux ~/hsm-linux and it'll
> > > recursively create all of the stub files and directories.
> > > 2. remote-fetch. This is the actual PoC, you just point it at the source and
> > > destination directory and then you can do whatever. ./remote-fetch ~/linux
> > > ~/hsm-linux.
> > > 3. mmap-validate. This was to validate the pagefault thing, this is likely what
> > > will be turned into the selftest with remote-fetch. It creates a file and
> > > then you can validate the file matches the right pattern with both normal
> > > reads and mmap. Normally I do something like
> > >
> > > ./mmap-validate create ~/src/foo
> > > ./populate ~/src ~/dst
> > > ./rmeote-fetch ~/src ~/dst
> > > ./mmap-validate validate ~/dst/foo
> > >
> > > I did a bunch of testing, I also got some performance numbers. I copied a
> > > kernel tree, and then did remote-fetch, and then make -j4
> > >
> > > Normal
> > > real 9m49.709s
> > > user 28m11.372s
> > > sys 4m57.304s
> > >
> > > HSM
> > > real 10m6.454s
> > > user 29m10.517s
> > > sys 5m2.617s
> > >
> > > So ~17 seconds more to build with HSM. I then did a make mrproper on both trees
> > > to see the size
> > >
> > > [root@fedora ~]# du -hs /src/linux
> > > 1.6G /src/linux
> > > [root@fedora ~]# du -hs dst
> > > 125M dst
> > >
> > > This mirrors the sort of savings we've seen in production.
> > >
> > > Meta has had these patches (minus the page fault patch) deployed in production
> > > for almost a year with our own utility for doing on-demand package fetching.
> > > The savings from this has been pretty significant.
> > >
> > > The page-fault hooks are necessary for the last thing we need, which is
> > > on-demand range fetching of executables. Some of our binaries are several gigs
> > > large, having the ability to remote fetch them on demand is a huge win for us
> > > not only with space savings, but with startup time of containers.
> >
> > So... does this pre-content fetcher already work for regular reads and
> > writes, and now you're looking to wire up page faults? Or does it only
> > handle page faults? Judging from Amir's patches I'm guessing the
> > FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications
> > prior to read and write page faults? The XFS change looks reasonable to
> > me, but I'm left wondering "what does this shiny /do/?"
> >
>
> I *think* I understand the confusion.
>
> Let me try to sort it out.
>
> This patch set collaboration aims to add the functionality of HSM
> service by adding events FS_PRE_{ACCESS,MODIFY} prior to
> read/write and page faults.
>
> Maybe you are puzzled by not seeing any new read/write hooks?
> This is because the HSM events utilize the existing fsnotify_file_*perm()
> hooks that are in place for the legacy FS_ACCESS_PERM event.
>
> So why is a new FS_PRE_ACCESS needed?
> Let me first quote commit message of patch 2/16 [1]:
> ---
> The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM,
> but it meant for a different use case of filling file content before
> access to a file range, so it has slightly different semantics.
>
> Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, same as
> we did for FS_OPEN_PERM/FS_OPEN_EXEC_PERM.
>
> FS_PRE_MODIFY is a new permission event, with similar semantics as
> FS_PRE_ACCESS, which is called before a file is modified.
>
> FS_ACCESS_PERM is reported also on blockdev and pipes, but the new
> pre-content events are only reported for regular files and dirs.
>
> The pre-content events are meant to be used by hierarchical storage
> managers that want to fill the content of files on first access.
> ---
>
> And from my man page draft [2]:
> ---
> FAN_PRE_ACCESS (since Linux 6.x)
> Create an event before a read from a directory or a file range,
> that provides an opportunity for the event listener to
> modify the content of the object
> before the reader is going to access the content in the
> specified range.
> An additional information record of type
> FAN_EVENT_INFO_TYPE_RANGE is returned
> for each event in the read buffer.
> ...
> ---
>
> So the semantics of the two events is slightly different, but also the
> meaning of "an opportunity for the event listener to modify the content".
>
> FS_ACCESS_PERM already provided this opportunity on old kernels,
> but prior to "Tidy up file permission hooks" series [3] in v6.8, writing
> file content in FS_ACCESS_PERM event context was prone to deadlocks.
>
> Therefore, an application using FS_ACCESS_PERM may be prone
> for deadlocks, while an application using FAN_PRE_ACCESS should
> be safe in that regard.
Ah, ok, that's where the missing pieces are. :)
--D
> Thanks,
> Amir.
>
> [1] https://lore.kernel.org/all/a96217d84dfebb15582a04524dc9821ba3ea1406.1723670362.git.josef@toxicpanda.com/
> [2] https://github.com/amir73il/man-pages/commits/fan_pre_path
> [3] https://lore.kernel.org/all/20231122122715.2561213-1-amir73il@gmail.com/
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults
2024-08-29 11:17 ` Jan Kara
@ 2024-08-30 23:28 ` Darrick J. Wong
2024-09-02 10:23 ` Jan Kara
0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2024-08-30 23:28 UTC (permalink / raw)
To: Jan Kara
Cc: Josef Bacik, kernel-team, linux-fsdevel, amir73il, brauner,
linux-xfs, gfs2, linux-bcachefs, Dave Chinner, Christoph Hellwig
On Thu, Aug 29, 2024 at 01:17:53PM +0200, Jan Kara wrote:
> On Wed 14-08-24 17:25:34, Josef Bacik wrote:
> > xfs has it's own handling for write faults, so we need to add the
> > pre-content fsnotify hook for this case. Reads go through filemap_fault
> > so they're handled properly there.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Looks good to me but it would be great to get explicit ack from some XFS
> guy... Some selection CCed :)
Looks decent to me, but I wonder why xfs_write_fault has to invoke
filemap_maybe_emit_fsnotify_event itself? Can that be done from
whatever calls ->page_mkwrite and friends?
--D
>
> Honza
>
> > ---
> > fs/xfs/xfs_file.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 4cdc54dc9686..e61c4c389d7d 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1283,6 +1283,10 @@ xfs_write_fault(
> > unsigned int lock_mode = XFS_MMAPLOCK_SHARED;
> > vm_fault_t ret;
> >
> > + ret = filemap_maybe_emit_fsnotify_event(vmf);
> > + if (unlikely(ret))
> > + return ret;
> > +
> > sb_start_pagefault(inode->i_sb);
> > file_update_time(vmf->vma->vm_file);
> >
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults
2024-08-30 23:28 ` Darrick J. Wong
@ 2024-09-02 10:23 ` Jan Kara
2024-09-02 11:19 ` Christian Brauner
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2024-09-02 10:23 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jan Kara, Josef Bacik, kernel-team, linux-fsdevel, amir73il,
brauner, linux-xfs, gfs2, linux-bcachefs, Dave Chinner,
Christoph Hellwig
On Fri 30-08-24 16:28:33, Darrick J. Wong wrote:
> On Thu, Aug 29, 2024 at 01:17:53PM +0200, Jan Kara wrote:
> > On Wed 14-08-24 17:25:34, Josef Bacik wrote:
> > > xfs has it's own handling for write faults, so we need to add the
> > > pre-content fsnotify hook for this case. Reads go through filemap_fault
> > > so they're handled properly there.
> > >
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >
> > Looks good to me but it would be great to get explicit ack from some XFS
> > guy... Some selection CCed :)
>
> Looks decent to me, but I wonder why xfs_write_fault has to invoke
> filemap_maybe_emit_fsnotify_event itself? Can that be done from
> whatever calls ->page_mkwrite and friends?
So we were discussing this already here [1]. The options we have:
1) Call filemap_maybe_emit_fsnotify_event() from filesystem hooks
(filemap_fault() for those who use it). This is a bit ugly because it
requires modification to filesystems with their own fault handlers.
2) Call filemap_maybe_emit_fsnotify_event() from generic code before
calling ->fault() and if we needed to send event (and thus dropped
mmap_lock), we will retry the fault. This requires no special fs awareness
but the ->fault hook will then be called after retry most of the times on
HSM managed fs and thus without possibility to drop mmap_lock making
contention there possibly worse.
3) (I don't think we've discussed this option yet): Call
filemap_maybe_emit_fsnotify_event() in generic code before calling ->fault
and then continue to call ->fault even if we've dropped mmap_lock. This
will require changing calling convention for ->fault as vmf->vma must not
be touched after we've dropped mmap_lock and practically all users end up
using it to get vmf->vma->vm_file. With per-fs opt in flag to enable HSM
events this could be manageable but frankly I'm not convinced the
complicated calling convention would be better outcome than 1). But I'm
open for discussion.
Honza
[1] https://lore.kernel.org/all/CAOQ4uxgXEzT=Buwu8SOkQG+2qcObmdH4NgsGme8bECObiobfTQ@mail.gmail.com
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults
2024-09-02 10:23 ` Jan Kara
@ 2024-09-02 11:19 ` Christian Brauner
0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2024-09-02 11:19 UTC (permalink / raw)
To: Jan Kara
Cc: Darrick J. Wong, Josef Bacik, kernel-team, linux-fsdevel,
amir73il, linux-xfs, gfs2, linux-bcachefs, Dave Chinner,
Christoph Hellwig
On Mon, Sep 02, 2024 at 12:23:44PM GMT, Jan Kara wrote:
> On Fri 30-08-24 16:28:33, Darrick J. Wong wrote:
> > On Thu, Aug 29, 2024 at 01:17:53PM +0200, Jan Kara wrote:
> > > On Wed 14-08-24 17:25:34, Josef Bacik wrote:
> > > > xfs has it's own handling for write faults, so we need to add the
> > > > pre-content fsnotify hook for this case. Reads go through filemap_fault
> > > > so they're handled properly there.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > >
> > > Looks good to me but it would be great to get explicit ack from some XFS
> > > guy... Some selection CCed :)
> >
> > Looks decent to me, but I wonder why xfs_write_fault has to invoke
> > filemap_maybe_emit_fsnotify_event itself? Can that be done from
> > whatever calls ->page_mkwrite and friends?
>
> So we were discussing this already here [1]. The options we have:
>
> 1) Call filemap_maybe_emit_fsnotify_event() from filesystem hooks
Sidenote: Can that be renamed to filemap_fsnotify() or something
similar. Especially that "maybe" in there really doesn't add value imho.
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2024-09-02 11:19 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-14 21:25 [PATCH v4 00/16] fanotify: add pre-content hooks Josef Bacik
2024-08-14 21:25 ` [PATCH v4 01/16] fanotify: don't skip extra event info if no info_mode is set Josef Bacik
2024-08-14 21:25 ` [PATCH v4 02/16] fsnotify: introduce pre-content permission event Josef Bacik
2024-08-14 21:25 ` [PATCH v4 03/16] fsnotify: generate pre-content permission event on open Josef Bacik
2024-08-14 21:25 ` [PATCH v4 04/16] fanotify: introduce FAN_PRE_ACCESS permission event Josef Bacik
2024-08-14 21:25 ` [PATCH v4 05/16] fanotify: introduce FAN_PRE_MODIFY " Josef Bacik
2024-08-14 21:25 ` [PATCH v4 06/16] fanotify: pass optional file access range in pre-content event Josef Bacik
2024-08-14 21:25 ` [PATCH v4 07/16] fanotify: rename a misnamed constant Josef Bacik
2024-08-14 21:25 ` [PATCH v4 08/16] fanotify: report file range info with pre-content events Josef Bacik
2024-08-29 10:13 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 09/16] fanotify: allow to set errno in FAN_DENY permission response Josef Bacik
2024-08-29 10:25 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 10/16] fanotify: add a helper to check for pre content events Josef Bacik
2024-08-14 21:25 ` [PATCH v4 11/16] fanotify: disable readahead if we have pre-content watches Josef Bacik
2024-08-29 10:48 ` Jan Kara
2024-08-29 12:44 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 12/16] mm: don't allow huge faults for files with pre content watches Josef Bacik
2024-08-29 10:51 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 13/16] fsnotify: generate pre-content permission event on page fault Josef Bacik
2024-08-29 11:07 ` Jan Kara
2024-08-14 21:25 ` [PATCH v4 14/16] bcachefs: add pre-content fsnotify hook to fault Josef Bacik
2024-08-29 11:10 ` Jan Kara
2024-08-29 11:26 ` Kent Overstreet
2024-08-29 12:46 ` Josef Bacik
2024-08-29 12:55 ` Kent Overstreet
2024-08-29 12:25 ` Kent Overstreet
2024-08-14 21:25 ` [PATCH v4 15/16] gfs2: " Josef Bacik
2024-08-29 11:15 ` Jan Kara
2024-08-29 11:26 ` Amir Goldstein
2024-08-29 11:43 ` Jan Kara
2024-08-29 12:49 ` Amir Goldstein
2024-08-29 12:42 ` Josef Bacik
2024-08-14 21:25 ` [PATCH v4 16/16] xfs: add pre-content fsnotify hook for write faults Josef Bacik
2024-08-29 11:17 ` Jan Kara
2024-08-30 23:28 ` Darrick J. Wong
2024-09-02 10:23 ` Jan Kara
2024-09-02 11:19 ` Christian Brauner
2024-08-29 21:41 ` [PATCH v4 00/16] fanotify: add pre-content hooks Darrick J. Wong
2024-08-30 8:55 ` Amir Goldstein
2024-08-30 23:22 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).