[RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV

Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed

* [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
@ 2024-02-25 22:06 Al Viro
  2024-02-26  6:35 ` Vegard Nossum
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Al Viro @ 2024-02-25 22:06 UTC (permalink / raw
  To: linux-fsdevel; +Cc: linux-doc, Linus Torvalds

	The text below is a rough approximation to what should, IMO,
end up in Documentation/filesystems/rcu-exposure.rst.  It started
as a part of audit notes.  Intended target audience of the text is
filesystem maintainers and/or folks who are reviewing fs code;
I hope the current contents is already useful to an extent, but
I'm pretty certain that it needs more massage before it could
go into the tree.

	Please, read and review - comments and suggestions would be
very welcome, both for contents and for markup - I'm *not* familiar
with ReST or the ways it's used in the kernel (in particular, footnotes
seem to get lost in PDF).

===================================================================
The ways in which RCU pathwalk can ruin filesystem maintainer's day
===================================================================

The problem: exposure of filesystem code to lockless environment
================================================================

Filesystem methods can usually count upon VFS-provided warranties
regarding the stability of objects they are called to act upon; at the
very least, they can expect the dentries/inodes/superblocks involved to
remain live throughout the operation.

Life would be much more painful without that; however, such warranties
do not come for free.  The problem is that access patterns are heavily
biased; every system call getting an absolute pathname will have to
start at root directory, etc.  Having each of them in effect write
"I'd been here" on the same memory objects would cost quite a bit.
As the result, we try to keep the fast path stores-free, bumping
no refcounts and taking no locks.  Details are described elsewhere,
but the bottom line for filesystems is that some methods may be called
with much looser warranties than usual.  Of course, from the filesystem
POV each of those is a potential source of headache - you are asked to
operate on an object that might start to be torn down right under you,
possibly along with the filesystem instance it lives on.

The last possibility might sound surprising, but a lazy pathwalk really
can run into a dentry on a filesystem that is getting shut down.

Here's one of the scenarios for that to happen:

	1. have two threads sharing fs_struct chdir'ed on that filesystem.
	2. lazy-umount it, so that the only thing holding it alive is
	   cwd of these threads.
	3. the first thread does relative pathname resolution
	   and gets to e.g. ->d_hash().  It's holding rcu_read_lock().
	4. at the same time the second thread does fchdir(), moving to
	   different directory.

In fchdir(2) we get to set_fs_pwd(), which set the current directory
to the new place and does mntput() on the old one.  No RCU delays here,
we calculate the refcount of that mount and see that we are dropping
the last reference.  We make sure that the pathwalk in progress in
the first thread will fail when it comes to legitimize_mnt() and do this
(in mntput_no_expire())::

	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
		return;

As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
on our mount, which hits deactivate_super().  That was the last reference to
superblock.

Voila - we have a filesystem shutdown right under the nose of a thread
running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
one can arrange the same for other methods called by rcu pathwalk.

It's not easy to hit (especially if you want to get through the
entire ->kill_sb() before the first thread gets through ->d_hash()),
and it's probably impossible on the real hardware; on KVM it might be
borderline doable.  However, it is possible and I would not swear that
other ways of arranging the same thing are equally hard to hit.

The bottom line: methods that can be called in RCU mode need to
be careful about the per-superblock objects destruction.

Opting out
==========

To large extent a filesystem can opt out of RCU pathwalk; that loses all
scalability benefits whenever you filesystem gets involved in pathname
resolution, though.  If that's the way you choose to go, just make sure
that

1. any ->d_revalidate(), ->permission(), ->get_link() and ->get_inode_acl()
instance bails out if called by RCU pathwalk (see below for details).
Costs a couple of lines of boilerplate in each.

2. if some symlink inodes have ->i_link set to a dynamically allocated
object, that object won't be freed without an RCU delay.  Anything
coallocated with inode is fine, so's anything freed from ->free_inode().
Usually comes for free, just remember to avoid freeing directly
from ->destroy_inode().

3. any ->d_hash() and ->d_compare() instances (if you have those) do
not access any filesystem objects.

4. there's no ->d_automount() instances in your filesystem.

If your case does not fit the above, the easy opt-out is not for you.
If so, you'll have to keep reading...

What methods are affected?
==========================

	The list of the methods that could run into that fun:

========================	==================================	=================
	method			indication that the call is unsafe	unstable objects
========================	==================================	=================
->d_hash(d, ...) 		none - any call might be		d
->d_compare(d, ...)		none - any call might be		d
->d_revalidate(d, f)		f & LOOKUP_RCU				d
->d_manage(d, f)		f					d
->permission(i, m)		m & MAY_NOT_BLOCK			i
->get_link(d, i, ...)		d == NULL				i
->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i
========================	==================================	=================

Additionally, callback set by set_delayed_call() from unsafe call of
->get_link() will be run in the same environment; that one is usually not
a problem, though.

For the sake of completeness, three of LSM methods
(->inode_permission(), ->inode_follow_link() and ->task_to_inode())
might be called in similar environment, but that's a problem for LSM
crowd, not for filesystem folks.

Any method call is, of course, required not to crash - no stepping on
freed memory, etc.  All of the unsafe calls listed above are done under
rcu_read_lock(), so they are not allowed to block.  Further requirements
vary between the methods.

Before going through the list of affected methods, several notes on
the things that _are_ guaranteed:

* if a reference to struct dentry is passed to such call, it will
  not be freed until the method returns.  The same goes for a reference to
  struct inode and to struct super_block pointed to by ->d_sb or ->i_sb
  members of dentry and inode resp.  Any of those might be in process of
  being torn down or enter such state right under us; the entire point of
  those unsafe calls is that we make them without telling anyone they'd
  need to wait for us.

* following ->d_parent and ->d_inode of such dentries is fine,
  provided that it's done by READ_ONCE() (for ->d_inode the preferred
  form is d_inode_rcu(dentry)).  The value of ->d_parent is never going
  to be NULL and it will again point to a struct dentry that will not be
  freed until the method call finishes.  The value of ->d_inode might
  be NULL; if non-NULL, it'll be pointing to a struct inode that will
  not be freed until the method call finishes.

* none of the inodes passed to an unsafe call could have reached
  fs/inode.c:evict() before the caller grabbed rcu_read_lock().

* for inodes 'not freed' means 'not entered ->free_inode()', so
  anything that won't be destroyed until ->free_inode() is safe to access.
  Anything synchronously destroyed in ->evict_inode() or ->destroy_inode()
  is not safe; however, one can count upon the call_rcu() callbacks
  issued in those yet to be entered.  Note that unlike dentries and
  superblocks, inodes are embedded into filesystem-private objects;
  anything stored directly in the containing object is safe to access.

* for dentries anything destroyed by ->d_prune() (synchronously or
  not) is not safe; the same goes for the things synchronously destroyed
  by ->d_release().  However, call_rcu() callbacks issued in ->d_release()
  are yet to be entered.

* for superblocks we can count upon call_rcu() callbacks issued
  from inside the ->kill_sb() (including the ones issued from
  ->put_super()) yet to be entered.  You can also count upon
  ->s_user_ns still being pinned and ->s_security still not
  freed.

* NOTE: we **can not** count upon the things like ->d_parent
  being positive (or a directory); a race with rename()+rmdir()+mknod()
  and you might find a FIFO as parent's inode.  NULL is even easier -
  just have the dentry and its ex-parent already past dentry_kill()
  (which is a normal situation for eviction on memory pressure) and there
  you go.  Normally such pathologies are prevented by the locking (and
  dentry refcounting), but... the entire point of that stuff is to avoid
  informing anyone that we are there, so those mechanisms are bypassed.
  What's more, if dentry is not pinned by refcount, grabbing its ->d_lock
  will *not* suffice to prevent that kind of mess - the scenario with
  eviction by memory pressure won't be prevented by that; you might have
  grabbed ->d_lock only after the dentry_kill() had released it, and
  at that point ->d_parent still points to what used to be the parent,
  but there's nothing to prevent its eviction.

->d_compare()
-------------

For ->d_compare() we just need to make sure it won't crash
when called for dying dentry - an incorrect return value won't harm the
caller in such case.  False positives and false negatives alike - the
callers take care of that.  To be pedantic, make that "false positives
do not cause problems unless they have ->d_manage()", but ->d_manage()
is present only on autofs and there's no autofs ->d_compare() instances.
[#f1]_

There is no indication that ->d_compare() is called in RCU mode;
the majority of callers are such, anyway, so we need to cope with that.
VFS guarantees that dentry won't be freed under us; the same goes for
the superblock pointed to by its ->d_sb.  Name points to memory object
that won't get freed under us and length does not exceed the size of
that object.  The contents of that object is *NOT* guaranteed to be
stable; d_move() might race with us, modifying the name.  However, in
that case we are free to return an arbitrary result - the callers will
take care of both false positives and false negatives in such case.
The name we are comparing dentry with (passed in qstr) is stable,
thankfully...

If we need to access any other data, it's up to the filesystem
to protect it.  In practice it means that destruction of fs-private part
of superblock (and possibly unicode tables hanging off it, etc.) might
need to be RCU-delayed.

*IF* you want the behaviour that varies depending upon the parent
directory, you get to be very careful with READ_ONCE() and watch out
for the object lifetimes.

Basically, if the things get that tricky, ask for help.
Currently there are two such instances in the tree - proc_sys_compare()
and generic_ci_d_compare().  Both are... special.

->d_hash()
----------

For ->d_hash() on a dying dentry we are free to report any hash
value; the only extra requirement is that we should not return stray
hard errors.  In other words, if we return anything other than 0 or
-ECHILD, we'd better make sure that this error would've been correct
before the parent started dying.  Since ->d_hash() error-reporting is
usually done to reject unacceptable names (too long, contain unsuitable
characters for this filesystem, etc.), that's really not a problem -
hard errors depend only upon the name, not the parent.

Again, VFS guarantees that freeing of dentry and of the superblock
pointed to by dentry->d_sb won't happen under us.  The name passed to
us (in qstr) is stable.  If you need anything beyond that, you are
in the same situation as with ->d_compare().  Might want to RCU-delay
freeing private part of superblock (if that's what we need to access),
might want the same for some objects hanging off that (unicode tables,
etc.).  If you need something beyond that - ask for help.

->d_revalidate()
----------------

For this one we do have an indication of call being unsafe -
flags & LOOKUP_RCU.  With ->d_revalidate we are always allowed to bail
out and return -ECHILD; that will have the caller drop out of RCU mode.
We definitely need to do that if revalidate would require any kind of IO,
mutex-taking, etc.; we can't block in RCU mode.

Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU
in flags as "return -ECHILD and be done with that"; it's guaranteed to
do the right thing, but you lose the benefits of RCU pathwalks whenever
you run into such dentry.

Same as with the previous methods, we are guaranteed that
dentry and dentry->d_sb won't be freed under us.  We are also guaranteed
that ->d_parent (which is *not* stable, so use READ_ONCE) points to a
struct dentry that won't get freed under us.  As always with ->d_parent,
it's not NULL - for a detached dentry it will point to dentry itself.
d_inode_rcu() of dentry and its parent will be either NULL or will
point to a struct inode that won't get freed under us.	Anything beyond
than that is not guaranteed.  We may find parent to be negative - it can
happen if we race with d_move() and removal of old parent.  In that case
just return -ECHILD and be done with that.

On non-RCU side you could use dget_parent() instead - that
would give a positive dentry and its ->d_inode would remain stable.
dget_parent() has to be paired with dput(), though, so it's not usable
in RCU mode.

If you need fs-private objects associated with dentry, its parent
inode(s) or superblock - see the general notes above on how to access
those.

->d_manage()
------------

Can be called in RCU mode; gets an argument telling it if it has
been called so.  Pretty much autofs-only; for everyone's sanity sake,
don't inflict more of those on the kernel.  Definitely don't do that
without asking first...

->permission()
--------------

Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK
in mask, and it can only happen for MAY_EXEC checks on directories.
In RCU mode it is not allowed to block, and it is allowed to bail out
by returning -ECHILD.  It might be called for an inode that is getting
torn down, possibly along with its filesystem.	Errors other than -ECHILD
should only be returned if they would've been returned in non-RCU mode;
several instances in procfs currently (6.5) run afoul of that one.  That's
an instructive example, BTW - what happens is that proc_pid_permission()
uses proc_get_task() to find the relevant process.  proc_get_task()
uses PID reference stored in struct proc_inode our inode is embedded
into; inode can't have been freed yet, so fetching ->pid member in that
is safe.  However, using the value you've fetched is a different story
- proc_evict_inode() would have passed it to put_pid() and replaced
it with NULL.  Unsafe caller has no way to tell if that is happening
right under it.  Solution: stop zeroing ->pid in proc_evict_inode()
and move put_pid() from proc_pid_evict_inode() to proc_free_inode().
That's not all that is needed (there's access to procfs-private part of
superblock as well), but it does make a good example of how such stuff
can be dealt with.

Note that idmap argument is safe on all calls - its destruction
is rcu-delayed.

The amount of headache is seriously reduced (for now) by the fact
that a lot of instances boil down to generic_permission() (which will
do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK.
If we ever extend RCU mode to other ->permission() callers, the thing will
get interesting; that's not likely to happen, though, unless access(2)
goes there [this is NOT a suggestion, folks].

->get_link()
------------

Again, this can be called in RCU mode.	Even if your ->d_revalidate()
always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
you can't assume that ->get_link() won't be reached [#f2]_.
NULL dentry argument is an indicator of unsafe call; if you can't handle
it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
with this method you really might need that) should be done with GFP_ATOMIC
in the unsafe case.

Whatever you pass to set_delayed_call() is going to be called
in the same mode as ->get_link() itself; not a problem for most of the
instances.  The string you return needs to stay there until the
callback gets called or, if no callback is set, until at least the
freeing of inode.  As usual, for an unsafe call the inode might be
in process of teardown, possibly along with the hosting filesystem.
The usual considerations apply.  The same, BTW, applies to whatever
you set in ->i_link - it must stay around at least until ->free_inode().

->get_inode_acl()
-----------------

Very limited exposure for that one - unsafe call is possible
only if you explicitly set ACL_DONT_CACHE as cached ACL value.
Only two filesystems (fuse and overlayfs) even bother.  Unsafe call
is indicated by explicit flag (the third argument of the method),
bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
apply for any access to data structures you might need to do.

.. rubric:: Footnotes

.. [#f1]

Some callers prevent being called for dying dentry (holding ->d_lock and
having verified !d_unhashed() or finding it in the list of inode's aliases
under ->i_lock).  For those the scenario in question simply cannot arise.

Some follow the match with lockref_get_not_dead() and treat the failure
as mismatch.  That takes care of false positives, and false negatives on
dying dentry are still correct - we simply pretend to have lost the race.

The only caller that does not fit into the classes above is
__d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
!d_unhashed() before calling ->d_compare().  That is not enough to
prevent dentry from starting to die right under us; however, the sampled
value of ->d_seq will be rechecked when the caller gets to step_into(),
so for a false positive we will end up with a mismatch.  The corner case
around ->d_manage() is due to the handle_mounts() done before step_into()
gets to ->d_seq validation...

.. [#f2]

binding a symlink on top of a regular file on another filesystem is possible
and that's all it takes for RCU pathwalk to get there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-25 22:06 [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV Al Viro
@ 2024-02-26  6:35 ` Vegard Nossum
  2024-02-26  9:13   ` Al Viro
  2024-02-26  7:58 ` Randy Dunlap
  2024-02-26 18:35 ` [RFC][v2] " Al Viro
  2 siblings, 1 reply; 8+ messages in thread
From: Vegard Nossum @ 2024-02-26  6:35 UTC (permalink / raw
  To: Al Viro, linux-fsdevel; +Cc: linux-doc, Linus Torvalds

Hi,

A couple of clarity-related suggestions:

On 25/02/2024 23:06, Al Viro wrote:
> ===================================================================
> The ways in which RCU pathwalk can ruin filesystem maintainer's day
> ===================================================================
> 
> The problem: exposure of filesystem code to lockless environment
> ================================================================
> 
> Filesystem methods can usually count upon VFS-provided warranties
> regarding the stability of objects they are called to act upon; at the
> very least, they can expect the dentries/inodes/superblocks involved to
> remain live throughout the operation.
> 
> Life would be much more painful without that; however, such warranties
> do not come for free.  The problem is that access patterns are heavily
> biased; every system call getting an absolute pathname will have to
> start at root directory, etc.  Having each of them in effect write
> "I'd been here" on the same memory objects would cost quite a bit.
> As the result, we try to keep the fast path stores-free, bumping
> no refcounts and taking no locks.  Details are described elsewhere,
> but the bottom line for filesystems is that some methods may be called
> with much looser warranties than usual.  Of course, from the filesystem

There is a slight apparent contradiction here between "at the very
least, they can expect [...] to remain live throughout the operation" in
the first paragraph (which sounds like they _do_ have these guarantees)
and most of the second paragraph (which says they _don't_ have these
guarantees).

I *think* what you are saying is that dentries/inodes/sbs involved will
indeed stay live (i.e. allocated), but that there are OTHER warranties
you might usually expect that are not there, such as objects not being
locked and potentially changing underneath your filesystem's VFS
callback or being in a partial state or other indirectly pointed-to
objects not being safe to access.

The source of the confusion for me is that you say "such warranties do
not come for free" and it sounds like it refers to the liveness warranty
that you just mentioned, whereas it actually refers to all the other
warranties that you did NOT mention yet at this point in the document.

How about changing it like this:

"""
Filesystem methods can usually count upon a number of VFS-provided
warranties regarding the stability of the dentries/inodes/superblocks
they are called to act upon. For example, they always can expect these
objects to remain live throughout the operation; life would be much more
painful without that.

However, such warranties do not come for free and other warranties may
not always be provided. [...]
"""

(As a side note, you may also want to actually link the docs we have for
RCU lookup where you say "details are described elsewhere".)

> What methods are affected?
> ==========================
> 
> 	The list of the methods that could run into that fun:
> 
> ========================	==================================	=================
> 	method			indication that the call is unsafe	unstable objects
> ========================	==================================	=================

I'd wish for explicit definitions of "unsafe" (which is a terminology
you do use more or less consistently in this doc) and "unstable". The
definitions don't need mathematical precision, but there should be a
quick one-line explanation of each.

I think "the call is unsafe" means that it doesn't have all the usual
safety warranties (as detailed above).

I think "unstable" means "not locked, can change underneath the
function" (but not that it can be freed), but it would be good to have
it spelled out.

> ->d_hash(d, ...) 		none - any call might be		d
> ->d_compare(d, ...)		none - any call might be		d
> ->d_revalidate(d, f)		f & LOOKUP_RCU				d
> ->d_manage(d, f)		f					d
> ->permission(i, m)		m & MAY_NOT_BLOCK			i
> ->get_link(d, i, ...)		d == NULL				i
> ->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i
> ========================	==================================	=================

FWIW, thanks for writing this document, it's a really useful addition.

Vegard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-25 22:06 [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV Al Viro
  2024-02-26  6:35 ` Vegard Nossum
@ 2024-02-26  7:58 ` Randy Dunlap
  2024-02-26 18:35 ` [RFC][v2] " Al Viro
  2 siblings, 0 replies; 8+ messages in thread
From: Randy Dunlap @ 2024-02-26  7:58 UTC (permalink / raw
  To: Al Viro, linux-fsdevel; +Cc: linux-doc, Linus Torvalds

Hi Al,

Just some editing suggestions below...


On 2/25/24 14:06, Al Viro wrote:
> 	The text below is a rough approximation to what should, IMO,
> end up in Documentation/filesystems/rcu-exposure.rst.  It started
> as a part of audit notes.  Intended target audience of the text is
> filesystem maintainers and/or folks who are reviewing fs code;
> I hope the current contents is already useful to an extent, but
> I'm pretty certain that it needs more massage before it could
> go into the tree.
> 
> 	Please, read and review - comments and suggestions would be
> very welcome, both for contents and for markup - I'm *not* familiar
> with ReST or the ways it's used in the kernel (in particular, footnotes
> seem to get lost in PDF).
> 
> 
> ===================================================================
> The ways in which RCU pathwalk can ruin filesystem maintainer's day
> ===================================================================
> 
> The problem: exposure of filesystem code to lockless environment
> ================================================================
> 
> Filesystem methods can usually count upon VFS-provided warranties
> regarding the stability of objects they are called to act upon; at the
> very least, they can expect the dentries/inodes/superblocks involved to
> remain live throughout the operation.
> 
> Life would be much more painful without that; however, such warranties
> do not come for free.  The problem is that access patterns are heavily
> biased; every system call getting an absolute pathname will have to
> start at root directory, etc.  Having each of them in effect write
> "I'd been here" on the same memory objects would cost quite a bit.
> As the result, we try to keep the fast path stores-free, bumping
> no refcounts and taking no locks.  Details are described elsewhere,
> but the bottom line for filesystems is that some methods may be called
> with much looser warranties than usual.  Of course, from the filesystem
> POV each of those is a potential source of headache - you are asked to
> operate on an object that might start to be torn down right under you,
> possibly along with the filesystem instance it lives on.
> 
> The last possibility might sound surprising, but a lazy pathwalk really
> can run into a dentry on a filesystem that is getting shut down.
> 
> Here's one of the scenarios for that to happen:
> 
> 	1. have two threads sharing fs_struct chdir'ed on that filesystem.
> 	2. lazy-umount it, so that the only thing holding it alive is
> 	   cwd of these threads.
> 	3. the first thread does relative pathname resolution
> 	   and gets to e.g. ->d_hash().  It's holding rcu_read_lock().
> 	4. at the same time the second thread does fchdir(), moving to
> 	   different directory.
> 
> In fchdir(2) we get to set_fs_pwd(), which set the current directory
> to the new place and does mntput() on the old one.  No RCU delays here,
> we calculate the refcount of that mount and see that we are dropping
> the last reference.  We make sure that the pathwalk in progress in
> the first thread will fail when it comes to legitimize_mnt() and do this
> (in mntput_no_expire())::
> 
> 	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> 	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> 		return;
> 
> As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
> on our mount, which hits deactivate_super().  That was the last reference to
> superblock.
> 
> Voila - we have a filesystem shutdown right under the nose of a thread
> running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
> one can arrange the same for other methods called by rcu pathwalk.
> 
> It's not easy to hit (especially if you want to get through the
> entire ->kill_sb() before the first thread gets through ->d_hash()),
> and it's probably impossible on the real hardware; on KVM it might be
> borderline doable.  However, it is possible and I would not swear that
> other ways of arranging the same thing are equally hard to hit.
> 
> The bottom line: methods that can be called in RCU mode need to
> be careful about the per-superblock objects destruction.
> 
> 
> Opting out
> ==========
> 
> To large extent a filesystem can opt out of RCU pathwalk; that loses all
> scalability benefits whenever you filesystem gets involved in pathname

                                your

> resolution, though.  If that's the way you choose to go, just make sure
> that
> 
> 1. any ->d_revalidate(), ->permission(), ->get_link() and ->get_inode_acl()
> instance bails out if called by RCU pathwalk (see below for details).
> Costs a couple of lines of boilerplate in each.
> 
> 2. if some symlink inodes have ->i_link set to a dynamically allocated
> object, that object won't be freed without an RCU delay.  Anything
> coallocated with inode is fine, so's anything freed from ->free_inode().
> Usually comes for free, just remember to avoid freeing directly
> from ->destroy_inode().
> 
> 3. any ->d_hash() and ->d_compare() instances (if you have those) do
> not access any filesystem objects.
> 
> 4. there's no ->d_automount() instances in your filesystem.
> 
> If your case does not fit the above, the easy opt-out is not for you.
> If so, you'll have to keep reading...
> 
> 
> What methods are affected?
> ==========================
> 
> 	The list of the methods that could run into that fun:
> 
> ========================	==================================	=================
> 	method			indication that the call is unsafe	unstable objects
> ========================	==================================	=================
> ->d_hash(d, ...) 		none - any call might be		d
> ->d_compare(d, ...)		none - any call might be		d
> ->d_revalidate(d, f)		f & LOOKUP_RCU				d
> ->d_manage(d, f)		f					d
> ->permission(i, m)		m & MAY_NOT_BLOCK			i
> ->get_link(d, i, ...)		d == NULL				i
> ->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i
> ========================	==================================	=================
> 
> 
> Additionally, callback set by set_delayed_call() from unsafe call of
> ->get_link() will be run in the same environment; that one is usually not
> a problem, though.
> 
> For the sake of completeness, three of LSM methods
> (->inode_permission(), ->inode_follow_link() and ->task_to_inode())
> might be called in similar environment, but that's a problem for LSM
> crowd, not for filesystem folks.
> 
> 
> Any method call is, of course, required not to crash - no stepping on
> freed memory, etc.  All of the unsafe calls listed above are done under
> rcu_read_lock(), so they are not allowed to block.  Further requirements
> vary between the methods.
> 
> 
> Before going through the list of affected methods, several notes on
> the things that _are_ guaranteed:
> 
> * if a reference to struct dentry is passed to such call, it will
>   not be freed until the method returns.  The same goes for a reference to
>   struct inode and to struct super_block pointed to by ->d_sb or ->i_sb
>   members of dentry and inode resp.  Any of those might be in process of

please spell out                respectively.

>   being torn down or enter such state right under us; the entire point of
>   those unsafe calls is that we make them without telling anyone they'd
>   need to wait for us.
> 
> * following ->d_parent and ->d_inode of such dentries is fine,
>   provided that it's done by READ_ONCE() (for ->d_inode the preferred
>   form is d_inode_rcu(dentry)).  The value of ->d_parent is never going
>   to be NULL and it will again point to a struct dentry that will not be
>   freed until the method call finishes.  The value of ->d_inode might
>   be NULL; if non-NULL, it'll be pointing to a struct inode that will
>   not be freed until the method call finishes.
> 
> * none of the inodes passed to an unsafe call could have reached
>   fs/inode.c:evict() before the caller grabbed rcu_read_lock().
> 
> * for inodes 'not freed' means 'not entered ->free_inode()', so
>   anything that won't be destroyed until ->free_inode() is safe to access.
>   Anything synchronously destroyed in ->evict_inode() or ->destroy_inode()
>   is not safe; however, one can count upon the call_rcu() callbacks
>   issued in those yet to be entered.  Note that unlike dentries and
>   superblocks, inodes are embedded into filesystem-private objects;
>   anything stored directly in the containing object is safe to access.
> 
> * for dentries anything destroyed by ->d_prune() (synchronously or
>   not) is not safe; the same goes for the things synchronously destroyed
>   by ->d_release().  However, call_rcu() callbacks issued in ->d_release()
>   are yet to be entered.
> 
> * for superblocks we can count upon call_rcu() callbacks issued
>   from inside the ->kill_sb() (including the ones issued from
>   ->put_super()) yet to be entered.  You can also count upon
>   ->s_user_ns still being pinned and ->s_security still not
>   freed.
> 
> * NOTE: we **can not** count upon the things like ->d_parent
>   being positive (or a directory); a race with rename()+rmdir()+mknod()
>   and you might find a FIFO as parent's inode.  NULL is even easier -
>   just have the dentry and its ex-parent already past dentry_kill()
>   (which is a normal situation for eviction on memory pressure) and there
>   you go.  Normally such pathologies are prevented by the locking (and
>   dentry refcounting), but... the entire point of that stuff is to avoid
>   informing anyone that we are there, so those mechanisms are bypassed.
>   What's more, if dentry is not pinned by refcount, grabbing its ->d_lock
>   will *not* suffice to prevent that kind of mess - the scenario with
>   eviction by memory pressure won't be prevented by that; you might have
>   grabbed ->d_lock only after the dentry_kill() had released it, and
>   at that point ->d_parent still points to what used to be the parent,
>   but there's nothing to prevent its eviction.
> 
> 
> ->d_compare()
> -------------
> 
> For ->d_compare() we just need to make sure it won't crash
> when called for dying dentry - an incorrect return value won't harm the
> caller in such case.  False positives and false negatives alike - the
> callers take care of that.  To be pedantic, make that "false positives
> do not cause problems unless they have ->d_manage()", but ->d_manage()
> is present only on autofs and there's no autofs ->d_compare() instances.
> [#f1]_
> 
> There is no indication that ->d_compare() is called in RCU mode;
> the majority of callers are such, anyway, so we need to cope with that.
> VFS guarantees that dentry won't be freed under us; the same goes for
> the superblock pointed to by its ->d_sb.  Name points to memory object
> that won't get freed under us and length does not exceed the size of
> that object.  The contents of that object is *NOT* guaranteed to be
> stable; d_move() might race with us, modifying the name.  However, in
> that case we are free to return an arbitrary result - the callers will
> take care of both false positives and false negatives in such case.
> The name we are comparing dentry with (passed in qstr) is stable,
> thankfully...
> 
> If we need to access any other data, it's up to the filesystem
> to protect it.  In practice it means that destruction of fs-private part
> of superblock (and possibly unicode tables hanging off it, etc.) might
> need to be RCU-delayed.
> 
> *IF* you want the behaviour that varies depending upon the parent
> directory, you get to be very careful with READ_ONCE() and watch out
> for the object lifetimes.
> 
> Basically, if the things get that tricky, ask for help.
> Currently there are two such instances in the tree - proc_sys_compare()
> and generic_ci_d_compare().  Both are... special.
> 
> 
> ->d_hash()
> ----------
> 
> For ->d_hash() on a dying dentry we are free to report any hash
> value; the only extra requirement is that we should not return stray
> hard errors.  In other words, if we return anything other than 0 or
> -ECHILD, we'd better make sure that this error would've been correct
> before the parent started dying.  Since ->d_hash() error-reporting is

no hyphen IMO                                             ^

> usually done to reject unacceptable names (too long, contain unsuitable
> characters for this filesystem, etc.), that's really not a problem -
> hard errors depend only upon the name, not the parent.
> 
> Again, VFS guarantees that freeing of dentry and of the superblock
> pointed to by dentry->d_sb won't happen under us.  The name passed to
> us (in qstr) is stable.  If you need anything beyond that, you are
> in the same situation as with ->d_compare().  Might want to RCU-delay
> freeing private part of superblock (if that's what we need to access),
> might want the same for some objects hanging off that (unicode tables,
> etc.).  If you need something beyond that - ask for help.
> 
> 
> ->d_revalidate()
> ----------------
> 
> For this one we do have an indication of call being unsafe -
> flags & LOOKUP_RCU.  With ->d_revalidate we are always allowed to bail
> out and return -ECHILD; that will have the caller drop out of RCU mode.
> We definitely need to do that if revalidate would require any kind of IO,
> mutex-taking, etc.; we can't block in RCU mode.
> 
> Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU
> in flags as "return -ECHILD and be done with that"; it's guaranteed to
> do the right thing, but you lose the benefits of RCU pathwalks whenever
> you run into such dentry.
> 
> Same as with the previous methods, we are guaranteed that
> dentry and dentry->d_sb won't be freed under us.  We are also guaranteed
> that ->d_parent (which is *not* stable, so use READ_ONCE) points to a
> struct dentry that won't get freed under us.  As always with ->d_parent,
> it's not NULL - for a detached dentry it will point to dentry itself.
> d_inode_rcu() of dentry and its parent will be either NULL or will
> point to a struct inode that won't get freed under us.	Anything beyond

s/tab/space/

> than that is not guaranteed.  We may find parent to be negative - it can
> happen if we race with d_move() and removal of old parent.  In that case
> just return -ECHILD and be done with that.
> 
> On non-RCU side you could use dget_parent() instead - that
> would give a positive dentry and its ->d_inode would remain stable.
> dget_parent() has to be paired with dput(), though, so it's not usable
> in RCU mode.
> 
> If you need fs-private objects associated with dentry, its parent
> inode(s) or superblock - see the general notes above on how to access
> those.
> 
> 
> ->d_manage()
> ------------
> 
> Can be called in RCU mode; gets an argument telling it if it has
> been called so.  Pretty much autofs-only; for everyone's sanity sake,
> don't inflict more of those on the kernel.  Definitely don't do that
> without asking first...
> 
> 
> ->permission()
> --------------
> 
> Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK
> in mask, and it can only happen for MAY_EXEC checks on directories.
> In RCU mode it is not allowed to block, and it is allowed to bail out
> by returning -ECHILD.  It might be called for an inode that is getting
> torn down, possibly along with its filesystem.	Errors other than -ECHILD

s/tab/space/

> should only be returned if they would've been returned in non-RCU mode;
> several instances in procfs currently (6.5) run afoul of that one.  That's
> an instructive example, BTW - what happens is that proc_pid_permission()
> uses proc_get_task() to find the relevant process.  proc_get_task()
> uses PID reference stored in struct proc_inode our inode is embedded
> into; inode can't have been freed yet, so fetching ->pid member in that
> is safe.  However, using the value you've fetched is a different story
> - proc_evict_inode() would have passed it to put_pid() and replaced
> it with NULL.  Unsafe caller has no way to tell if that is happening
> right under it.  Solution: stop zeroing ->pid in proc_evict_inode()
> and move put_pid() from proc_pid_evict_inode() to proc_free_inode().
> That's not all that is needed (there's access to procfs-private part of
> superblock as well), but it does make a good example of how such stuff
> can be dealt with.
> 
> Note that idmap argument is safe on all calls - its destruction
> is rcu-delayed.
> 
> The amount of headache is seriously reduced (for now) by the fact
> that a lot of instances boil down to generic_permission() (which will
> do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK.
> If we ever extend RCU mode to other ->permission() callers, the thing will
> get interesting; that's not likely to happen, though, unless access(2)
> goes there [this is NOT a suggestion, folks].
> 
> 
> ->get_link()
> ------------
> 
> Again, this can be called in RCU mode.	Even if your ->d_revalidate()

s/tab/space/

> always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
> you can't assume that ->get_link() won't be reached [#f2]_.
> NULL dentry argument is an indicator of unsafe call; if you can't handle
> it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
> with this method you really might need that) should be done with GFP_ATOMIC
> in the unsafe case.
> 
> Whatever you pass to set_delayed_call() is going to be called
> in the same mode as ->get_link() itself; not a problem for most of the
> instances.  The string you return needs to stay there until the
> callback gets called or, if no callback is set, until at least the
> freeing of inode.  As usual, for an unsafe call the inode might be
> in process of teardown, possibly along with the hosting filesystem.
> The usual considerations apply.  The same, BTW, applies to whatever
> you set in ->i_link - it must stay around at least until ->free_inode().
> 
> 
> ->get_inode_acl()
> -----------------
> 
> Very limited exposure for that one - unsafe call is possible
> only if you explicitly set ACL_DONT_CACHE as cached ACL value.
> Only two filesystems (fuse and overlayfs) even bother.  Unsafe call
> is indicated by explicit flag (the third argument of the method),
> bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
> apply for any access to data structures you might need to do.
> 
> .. rubric:: Footnotes
> 
> .. [#f1]
> 
> Some callers prevent being called for dying dentry (holding ->d_lock and
> having verified !d_unhashed() or finding it in the list of inode's aliases
> under ->i_lock).  For those the scenario in question simply cannot arise.
> 
> Some follow the match with lockref_get_not_dead() and treat the failure
> as mismatch.  That takes care of false positives, and false negatives on
> dying dentry are still correct - we simply pretend to have lost the race.
> 
> The only caller that does not fit into the classes above is
> __d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
> !d_unhashed() before calling ->d_compare().  That is not enough to
> prevent dentry from starting to die right under us; however, the sampled
> value of ->d_seq will be rechecked when the caller gets to step_into(),
> so for a false positive we will end up with a mismatch.  The corner case
> around ->d_manage() is due to the handle_mounts() done before step_into()
> gets to ->d_seq validation...
> 
> .. [#f2]
> 
> binding a symlink on top of a regular file on another filesystem is possible
> and that's all it takes for RCU pathwalk to get there.
> 

Thanks for the documentation.

-- 
#Randy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-26  6:35 ` Vegard Nossum
@ 2024-02-26  9:13   ` Al Viro
  0 siblings, 0 replies; 8+ messages in thread
From: Al Viro @ 2024-02-26  9:13 UTC (permalink / raw
  To: Vegard Nossum; +Cc: linux-fsdevel, linux-doc, Linus Torvalds

On Mon, Feb 26, 2024 at 07:35:17AM +0100, Vegard Nossum wrote:

> There is a slight apparent contradiction here between "at the very
> least, they can expect [...] to remain live throughout the operation" in
> the first paragraph (which sounds like they _do_ have these guarantees)
> and most of the second paragraph (which says they _don't_ have these
> guarantees).
> 
> I *think* what you are saying is that dentries/inodes/sbs involved will
> indeed stay live (i.e. allocated), but that there are OTHER warranties
> you might usually expect that are not there, such as objects not being
> locked and potentially changing underneath your filesystem's VFS
> callback or being in a partial state or other indirectly pointed-to
> objects not being safe to access.

Live != memory object hasn't been freed yet.  It's a lot stronger than
that.  And most of the filesystem methods get those stronger warranties;
life would be very hard if we did not have those.

E.g. when you are in the middle of ->read(), you know that struct file
passed to you won't reach ->release() until after your ->read() returns,
that the filesystem it's on hasn't even started to be shut down, that
its in-core inode won't get to ->evict_inode(), that its dentry is
still associated with the same inode and will stay that way until you are
done, etc.

Normally we do get that kind of warranties - caller holds references
to the objects we are asked to operate upon.  However, the fast
path of pathname resolution (everything's in the VFS caches, no IO
or blocking operations needed, etc.) is an exception.  Several filesystem
methods (the ones involved in the fast path) may be called with
the warranties that are weaker than what they (and the rest of the
methods) normally get.  Note that e.g.  ->lookup() does not need to worry -
it's off the fast path pretty much by definition and VFS switches to
pinning objects before calling anything of that sort.

"Unsafe call" refers to the method calls made by RCU pathwalk with
weaker warranties.  Part of the objects passed to those might have
already started on the way through their destructors.

> Filesystem methods can usually count upon a number of VFS-provided
> warranties regarding the stability of the dentries/inodes/superblocks
> they are called to act upon. For example, they always can expect these
> objects to remain live throughout the operation; life would be much more
> painful without that.
> 
> However, such warranties do not come for free and other warranties may
> not always be provided. [...]
> """

Maybe...

> (As a side note, you may also want to actually link the docs we have for
> RCU lookup where you say "details are described elsewhere".)
> 
> > What methods are affected?
> > ==========================
> > 
> > 	The list of the methods that could run into that fun:
> > 
> > ========================	==================================	=================
> > 	method			indication that the call is unsafe	unstable objects
> > ========================	==================================	=================
> 
> I'd wish for explicit definitions of "unsafe" (which is a terminology
> you do use more or less consistently in this doc) and "unstable". The
> definitions don't need mathematical precision, but there should be a
> quick one-line explanation of each.

See above.

> I think "the call is unsafe" means that it doesn't have all the usual
> safety warranties (as detailed above).
> 
> I think "unstable" means "not locked, can change underneath the
> function" (but not that it can be freed), but it would be good to have
> it spelled out.

Nope.  "Locked" is not an issue.  "Might be hit by a destructor called by
another thread right under your nose" is.  It's _that_ unpleasant.  Fortunately,
most of the nastiness is on the VFS side, but there's a good reason why
quite a few filesystems simply bail out and tell VFS to piss off and not
come back without having grabbed the references, so that nothing of that
sort would have to be dealt with.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC][v2] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-25 22:06 [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV Al Viro
  2024-02-26  6:35 ` Vegard Nossum
  2024-02-26  7:58 ` Randy Dunlap
@ 2024-02-26 18:35 ` Al Viro
  2024-02-27  8:04   ` [RFC][v3] " Al Viro
  2 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2024-02-26 18:35 UTC (permalink / raw
  To: linux-fsdevel; +Cc: linux-doc, Linus Torvalds

On Sun, Feb 25, 2024 at 05:06:40PM -0500, Al Viro wrote:
> 	The text below is a rough approximation to what should, IMO,
> end up in Documentation/filesystems/rcu-exposure.rst.  It started
> as a part of audit notes.  Intended target audience of the text is
> filesystem maintainers and/or folks who are reviewing fs code;
> I hope the current contents is already useful to an extent, but
> I'm pretty certain that it needs more massage before it could
> go into the tree.
> 
> 	Please, read and review - comments and suggestions would be
> very welcome, both for contents and for markup - I'm *not* familiar
> with ReST or the ways it's used in the kernel (in particular, footnotes
> seem to get lost in PDF).

Updated variant follows.  Changes since the previous:
* beginning has been rewritten
* typos spotted by Randy should be fixed
* dumb braino in "opt out" part fixed (->d_automount is irrelevant, it's
->d_manage one needs to watch out for)
* a stale bit in discussion of ->permission() ("currently (6.5) run afoul")
updated (to "did (prior to 6.8-rc6) run afoul").
* I gave up on ReST footnotes and did something similar manually.

The thing that worries me is the balance in the background part -
keeping it detailed enough to explain what's going on vs.
keeping it short enough so the readers wouldn't get too scared before
they get to "you might have no exposure and you might be able to
opt out" part.

Suggestions and comments would be welcome.

===================================================================
The ways in which RCU pathwalk can ruin filesystem maintainer's day
===================================================================

The problem: exposure of filesystem code to lockless environment
================================================================

Filesystem methods can usually count upon VFS-provided warranties
regarding the stability of objects they are called to act upon.

The kernel is inherently multi-threaded environment and the objects one
thread works with might be accessed by other threads to an extent that
depends upon the locking, but there is a pretty fundamental expectation:
if somebody calls your function and passes a reference to some object as
an argument, you can expect that this object will not be destroyed by
another thread right under your nose.  Arranging for that is the callers'
responsibility and the common way to provide such warranties is some form
of refcounting and/or locking.

At the very least, filesystem methods normally expect the files/dentries/inodes/superblocks
involved will live throughout the operation - their destructors will not
be called by another thread while the method is executed.

However, there is a nasty exception to that rule.  The fast path in
pathname resolution (the case when everything we need is in VFS caches)
tries very hard to operate without stores to shared data objects.

The problem is that access patterns are heavily biased; every system call
getting an absolute pathname will have to start at root directory, etc.
Having each of them in effect write "I'd been here" on the same memory
objects would cost quite a bit.

As the result, we try to keep the fast path stores-free, bumping
no refcounts and taking no locks.  Details are described elsewhere
(Documentation/filesystems/path-lookup.txt), but the bottom line for
filesystems is that the methods involved in fast path of pathname
resolution may be called with much looser warranties than usual.

And "looser" might be an understatement - we are talking about having
to work with objects that might be in process of being torn down by
another thread, possibly all the way down to the filesystem instance
that object lives on (yes, really - see [0] for details).  Of course,
from the filesystem POV every call like that is a potential source of
headache.

Such unsafe method calls do get *some* warranties.  Before going into
the details, keep in mind that

* few methods are affected and their defaults (i.e. what we get if the
  method left NULL) are safe.  If your filesystem does not need to override
  any of those defaults, you are fine (there is a minor nit regarding the
  cached fast symlinks, but that's easy to take care of).

* for most of those methods there is a way to bail out and tell VFS to
  piss off and come back when it holds proper references.  That's what
  has to happen when an instance sees that it can't go on without
  a blocking operation, but you can use the same mechanism to bail out
  as soon as you see an unsafe call.

Which methods are affected?
===========================

	The list of the methods that could run into that fun:

========================	==================================	===================	====================
	method			indication that the call is unsafe	unprotected objects	bailout return value
========================	==================================	===================	====================
->d_hash(d, ...) 		none - any call might be		d
->d_compare(d, ...)		none - any call might be		d
->d_revalidate(d, f)		f & LOOKUP_RCU				d			-ECHILD
->d_manage(d, f)		f					d			-ECHILD
->permission(i, m)		m & MAY_NOT_BLOCK			i			-ECHILD
->get_link(d, i, ...)		d == NULL				i			ERR_PTR(-ECHILD)
->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i			ERR_PTR(-ECHILD)
========================	==================================	===================	====================

Additionally, callback set by set_delayed_call() from unsafe call of
->get_link() will be run in the same environment; that one is usually not
a problem, though.

For the sake of completeness, three of LSM methods
(->inode_permission(), ->inode_follow_link() and ->task_to_inode())
might be called in similar environment, but that's a problem for LSM
crowd, not for filesystem folks.

Opting out
==========

To large extent a filesystem can opt out of RCU pathwalk; that loses all
scalability benefits whenever your filesystem gets involved in pathname
resolution, though.  If that's the way you choose to go, just make sure
that

1. any non-default ->d_revalidate(), ->permission(), ->get_link() and
->get_inode_acl() instance bails out if called by RCU pathwalk (see below
for details).  Costs a couple of lines of boilerplate in each.

2. if some symlink inodes have ->i_link set to a dynamically allocated
object, that object won't be freed without an RCU delay.  Anything
coallocated with inode is fine, so's anything freed from ->free_inode().
Usually comes for free, just remember to avoid freeing directly
from ->destroy_inode().

3. any ->d_hash() and ->d_compare() instances (if you have those) do
not access any filesystem objects.

4. there's no ->d_manage() instances in your filesystem.

If your case does not fit the above, the easy opt-out is not for you.
If so, you'll have to keep reading...

What is guaranteed and what is required?
========================================

Any method call is, of course, required not to crash - no stepping on
freed memory, etc.  All of the unsafe calls listed above are done under
rcu_read_lock(), so they are not allowed to block.  Further requirements
vary between the methods.

Before going through the list of affected methods, several notes on
the things that *are* guaranteed:

* if a reference to struct dentry is passed to such call, it will
  not be freed until the method returns.  The same goes for a reference to
  struct inode and to struct super_block pointed to by ->d_sb or ->i_sb
  members of dentry and inode respectively.  Any of those might be in
  process of being torn down or enter such state right under us;
  the entire point of those unsafe calls is that we make them without
  telling anyone they'd need to wait for us.

* following ->d_parent and ->d_inode of such dentries is fine,
  provided that it's done by READ_ONCE() (for ->d_inode the preferred
  form is d_inode_rcu(dentry)).  The value of ->d_parent is never going
  to be NULL and it will again point to a struct dentry that will not be
  freed until the method call finishes.  The value of ->d_inode might
  be NULL; if non-NULL, it'll be pointing to a struct inode that will
  not be freed until the method call finishes.

* none of the inodes passed to an unsafe call could have reached
  fs/inode.c:evict() before the caller grabbed rcu_read_lock().

* for inodes 'not freed' means 'not entered ->free_inode()', so
  anything that won't be destroyed until ->free_inode() is safe to access.
  Anything synchronously destroyed in ->evict_inode() or ->destroy_inode()
  is not safe; however, one can count upon the call_rcu() callbacks
  issued in those yet to be entered.  Note that unlike dentries and
  superblocks, inodes are embedded into filesystem-private objects;
  anything stored directly in the containing object is safe to access.

* for dentries anything destroyed by ->d_prune() (synchronously or
  not) is not safe; the same goes for the things synchronously destroyed
  by ->d_release().  However, call_rcu() callbacks issued in ->d_release()
  are yet to be entered.

* for superblocks we can count upon call_rcu() callbacks issued
  from inside the ->kill_sb() (including the ones issued from
  ->put_super()) yet to be entered.  You can also count upon
  ->s_user_ns still being pinned and ->s_security still not
  freed.

* NOTE: we **can not** count upon the things like ->d_parent
  being positive (or a directory); a race with rename()+rmdir()+mknod()
  and you might find a FIFO as parent's inode.  NULL is even easier -
  just have the dentry and its ex-parent already past dentry_kill()
  (which is a normal situation for eviction on memory pressure) and there
  you go.  Normally such pathologies are prevented by the locking (and
  dentry refcounting), but... the entire point of that stuff is to avoid
  informing anyone that we are there, so those mechanisms are bypassed.
  What's more, if dentry is not pinned by refcount, grabbing its ->d_lock
  will *not* suffice to prevent that kind of mess - the scenario with
  eviction by memory pressure won't be prevented by that; you might have
  grabbed ->d_lock only after the dentry_kill() had released it, and
  at that point ->d_parent still points to what used to be the parent,
  but there's nothing to prevent its eviction.

->d_compare()
-------------

For ->d_compare() we just need to make sure it won't crash
when called for dying dentry - an incorrect return value won't harm the
caller in such case.  False positives and false negatives alike - the
callers take care of that.  To be pedantic, make that "false positives
do not cause problems unless they have ->d_manage()", but ->d_manage()
is present only on autofs and there's no autofs ->d_compare() instances.
See [1] for details, if you are curious.

There is no indication that ->d_compare() is called in RCU mode;
the majority of callers are such, anyway, so we need to cope with that.
VFS guarantees that dentry won't be freed under us; the same goes for
the superblock pointed to by its ->d_sb.  Name points to memory object
that won't get freed under us and length does not exceed the size of
that object.  The contents of that object is *NOT* guaranteed to be
stable; d_move() might race with us, modifying the name.  However, in
that case we are free to return an arbitrary result - the callers will
take care of both false positives and false negatives in such case.
The name we are comparing dentry with (passed in qstr) is stable,
thankfully...

If we need to access any other data, it's up to the filesystem
to protect it.  In practice it means that destruction of fs-private part
of superblock (and possibly unicode tables hanging off it, etc.) might
need to be RCU-delayed.

*IF* you want the behaviour that varies depending upon the parent
directory, you get to be very careful with READ_ONCE() and watch out
for the object lifetimes.

Basically, if the things get that tricky, ask for help.
Currently there are two such instances in the tree - proc_sys_compare()
and generic_ci_d_compare().  Both are... special.

->d_hash()
----------

For ->d_hash() on a dying dentry we are free to report any hash
value; the only extra requirement is that we should not return stray
hard errors.  In other words, if we return anything other than 0 or
-ECHILD, we'd better make sure that this error would've been correct
before the parent started dying.  Since ->d_hash() error reporting is
usually done to reject unacceptable names (too long, contain unsuitable
characters for this filesystem, etc.), that's really not a problem -
hard errors depend only upon the name, not the parent.

Again, VFS guarantees that freeing of dentry and of the superblock
pointed to by dentry->d_sb won't happen under us.  The name passed to
us (in qstr) is stable.  If you need anything beyond that, you are
in the same situation as with ->d_compare().  Might want to RCU-delay
freeing private part of superblock (if that's what we need to access),
might want the same for some objects hanging off that (unicode tables,
etc.).  If you need something beyond that - ask for help.

->d_revalidate()
----------------

For this one we do have an indication of call being unsafe -
flags & LOOKUP_RCU.  With ->d_revalidate we are always allowed to bail
out and return -ECHILD; that will have the caller drop out of RCU mode.
We definitely need to do that if revalidate would require any kind of IO,
mutex-taking, etc.; we can't block in RCU mode.

Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU
in flags as "return -ECHILD and be done with that"; it's guaranteed to
do the right thing, but you lose the benefits of RCU pathwalks whenever
you run into such dentry.

Same as with the previous methods, we are guaranteed that
dentry and dentry->d_sb won't be freed under us.  We are also guaranteed
that ->d_parent (which is *not* stable, so use READ_ONCE) points to a
struct dentry that won't get freed under us.  As always with ->d_parent,
it's not NULL - for a detached dentry it will point to dentry itself.
d_inode_rcu() of dentry and its parent will be either NULL or will
point to a struct inode that won't get freed under us.  Anything beyond
than that is not guaranteed.  We may find parent to be negative - it can
happen if we race with d_move() and removal of old parent.  In that case
just return -ECHILD and be done with that.

On non-RCU side you could use dget_parent() instead - that
would give a positive dentry and its ->d_inode would remain stable.
dget_parent() has to be paired with dput(), though, so it's not usable
in RCU mode.

If you need fs-private objects associated with dentry, its parent
inode(s) or superblock - see the general notes above on how to access
those.

->d_manage()
------------

Can be called in RCU mode; gets an argument telling it if it has
been called so.  Pretty much autofs-only; for everyone's sanity sake,
don't inflict more of those on the kernel.  Definitely don't do that
without asking first...

->permission()
--------------

Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK
in mask, and it can only happen for MAY_EXEC checks on directories.
In RCU mode it is not allowed to block, and it is allowed to bail out
by returning -ECHILD.  It might be called for an inode that is getting
torn down, possibly along with its filesystem.  Errors other than -ECHILD
should only be returned if they would've been returned in non-RCU mode;
several instances in procfs did (prior to 6.8-rc6) run afoul of that one.
That's an instructive example, BTW - what happens is that proc_pid_permission()
uses proc_get_task() to find the relevant process.  proc_get_task()
uses PID reference stored in struct proc_inode our inode is embedded
into; inode can't have been freed yet, so fetching ->pid member in that
is safe.  However, using the value you've fetched is a different story
- proc_evict_inode() would have passed it to put_pid() and replaced
it with NULL.  Unsafe caller has no way to tell if that is happening
right under it.  Solution: stop zeroing ->pid in proc_evict_inode()
and move put_pid() from proc_pid_evict_inode() to proc_free_inode().
That's not all that is needed (there's access to procfs-private part of
superblock as well), but it does make a good example of how such stuff
can be dealt with.

Note that idmap argument is safe on all calls - its destruction
is rcu-delayed.

The amount of headache is seriously reduced (for now) by the fact
that a lot of instances boil down to generic_permission() (which will
do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK.
If we ever extend RCU mode to other ->permission() callers, the thing will
get interesting; that's not likely to happen, though, unless access(2)
goes there [this is NOT a suggestion, folks].

->get_link()
------------

Again, this can be called in RCU mode.  Even if your ->d_revalidate()
always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
you can't assume that ->get_link() won't be reached (see [2] for
details).

NULL dentry argument is an indicator of unsafe call; if you can't handle
it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
with this method you really might need that) should be done with GFP_ATOMIC
in the unsafe case.

Whatever you pass to set_delayed_call() is going to be called
in the same mode as ->get_link() itself; not a problem for most of the
instances.  The string you return needs to stay there until the
callback gets called or, if no callback is set, until at least the
freeing of inode.  As usual, for an unsafe call the inode might be
in process of teardown, possibly along with the hosting filesystem.
The usual considerations apply.  The same, BTW, applies to whatever
you set in ->i_link - it must stay around at least until ->free_inode().

->get_inode_acl()
-----------------

Very limited exposure for that one - unsafe call is possible
only if you explicitly set ACL_DONT_CACHE as cached ACL value.
Only two filesystems (fuse and overlayfs) even bother.  Unsafe call
is indicated by explicit flag (the third argument of the method),
bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
apply for any access to data structures you might need to do.

Footnotes
=========

[0]  The fast path of pathname resolution really can run into a dentry on
a filesystem that is getting shut down.

Here's one of the scenarios for that to happen:

	1. have two threads sharing fs_struct chdir'ed on that filesystem.
	2. lazy-umount it, so that the only thing holding it alive is
	   cwd of these threads.
	3. the first thread does relative pathname resolution
	   and gets to e.g. ->d_hash().  It's holding rcu_read_lock().
	4. at the same time the second thread does fchdir(), moving to
	   different directory.

In fchdir(2) we get to set_fs_pwd(), which set the current directory
to the new place and does mntput() on the old one.  No RCU delays here,
we calculate the refcount of that mount and see that we are dropping
the last reference.  We make sure that the pathwalk in progress in
the first thread will fail when it comes to legitimize_mnt() and do this
(in mntput_no_expire())::

	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
		return;

As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
on our mount, which hits deactivate_super().  That was the last reference to
superblock.

Voila - we have a filesystem shutdown right under the nose of a thread
running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
one can arrange the same for other methods called by rcu pathwalk.

It's not easy to hit (especially if you want to get through the
entire ->kill_sb() before the first thread gets through ->d_hash()),
and it's probably impossible on the real hardware; on KVM it might be
borderline doable.  However, it is possible and I would not swear that
other ways of arranging the same thing are equally hard to hit.

The bottom line: methods that can be called in RCU mode need to
be careful about the per-superblock objects destruction.

[1]

Some callers prevent being called for dying dentry (holding ->d_lock and
having verified !d_unhashed() or finding it in the list of inode's aliases
under ->i_lock).  For those the scenario in question simply cannot arise.

Some follow the match with lockref_get_not_dead() and treat the failure
as mismatch.  That takes care of false positives, and false negatives on
dying dentry are still correct - we simply pretend to have lost the race.

The only caller that does not fit into the classes above is
__d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
!d_unhashed() before calling ->d_compare().  That is not enough to
prevent dentry from starting to die right under us; however, the sampled
value of ->d_seq will be rechecked when the caller gets to step_into(),
so for a false positive we will end up with a mismatch.  The corner case
around ->d_manage() is due to the handle_mounts() done before step_into()
gets to ->d_seq validation...

[2]

binding a symlink on top of a regular file on another filesystem is possible
and that's all it takes for RCU pathwalk to get there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC][v3] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-26 18:35 ` [RFC][v2] " Al Viro
@ 2024-02-27  8:04   ` Al Viro
  2024-02-28  4:20     ` Akira Yokosawa
  2024-02-28 13:02     ` [RFC][v4] " Al Viro
  0 siblings, 2 replies; 8+ messages in thread
From: Al Viro @ 2024-02-27  8:04 UTC (permalink / raw
  To: linux-fsdevel; +Cc: linux-doc, Linus Torvalds

On Mon, Feb 26, 2024 at 01:35:12PM -0500, Al Viro wrote:
> On Sun, Feb 25, 2024 at 05:06:40PM -0500, Al Viro wrote:
> > 	The text below is a rough approximation to what should, IMO,
> > end up in Documentation/filesystems/rcu-exposure.rst.  It started
> > as a part of audit notes.  Intended target audience of the text is
> > filesystem maintainers and/or folks who are reviewing fs code;
> > I hope the current contents is already useful to an extent, but
> > I'm pretty certain that it needs more massage before it could
> > go into the tree.
> > 
> > 	Please, read and review - comments and suggestions would be
> > very welcome, both for contents and for markup - I'm *not* familiar
> > with ReST or the ways it's used in the kernel (in particular, footnotes
> > seem to get lost in PDF).
> 
> Updated variant follows.  Changes since the previous:
> * beginning has been rewritten
> * typos spotted by Randy should be fixed
> * dumb braino in "opt out" part fixed (->d_automount is irrelevant, it's
> ->d_manage one needs to watch out for)
> * a stale bit in discussion of ->permission() ("currently (6.5) run afoul")
> updated (to "did (prior to 6.8-rc6) run afoul").
> * I gave up on ReST footnotes and did something similar manually.

Hopefully that one is better - the introductory part should be less handwavy
and easier to follow now...

===================================================================
The ways in which RCU pathwalk can ruin filesystem maintainer's day
===================================================================

The problem: exposure of filesystem code to lockless environment
================================================================

Filesystem methods can usually count upon VFS-provided warranties
regarding the stability of objects they are called to act upon.

The kernel is an inherently multi-threaded environment and the objects one
thread works with might be accessed by other threads to an extent that
depends upon the locking, but there is a pretty fundamental expectation:
if somebody calls your function and passes a reference to some object as
an argument, you can expect that this object will not be destroyed by
another thread right under your nose.  Arranging for that is the callers'
responsibility and the common way to provide such warranties is some form
of refcounting and/or locking.

At the very least, filesystem methods normally can expect that
files/dentries/inodes/superblocks involved will live throughout the
operation - their destructors will not be called by another thread while
the method is executed.

However, there is one important case where providing such warranties
is rather costly.  On the fast path in pathname resolution (the case
when everything we need is in VFS caches) grabbing and dropping dentry
references on each object we pass through ends up with unpleasant
scalability issues.

The problem is that access patterns are heavily biased; every system call
getting an absolute pathname will have to start at root directory, etc.
Having each of them in effect write "I'd been here" on the same memory
objects would cost quite a bit.

To deal with that we try to keep the fast path stores-free, bumping
no refcounts and taking no locks.  Details are described elsewhere
(Documentation/filesystems/path-lookup.txt), but the bottom line for
filesystems is that the methods involved in fast path of pathname
resolution may be called with much looser warranties than usual - the
caller deliberately does *not* take steps that would normally protect
the object from being torn down by another thread while the method is
trying to work with it.  That applies not just to dentries - associated
inodes and even the filesystem instance the object belongs to could be
in process of getting torn down (yes, really - see [0] for details).
Of course, from the filesystem POV every call like that is a potential
source of headache.

Such unsafe method calls do get *some* warranties.  Before going into
the details, keep in mind that

* few methods are affected and their defaults (i.e. what we get if the
  method is left NULL) are safe.  If your filesystem does not need to
  override any of those defaults, you are fine (there is a minor nit
  regarding the cached fast symlinks, but that's easy to take care of).

* for most of those methods there is a way to bail out and tell VFS to
  leave the fast path and switch to holding proper references from
  that point on.  That's what has to happen when an instance sees that it
  can't go on without a blocking operation, but you can use the same
  mechanism to bail out as soon as you see an unsafe call.

Which methods are affected?
===========================

	The list of the methods that could run into that fun:

========================	==================================	===================	====================
	method			indication that the call is unsafe	unprotected objects	bailout return value
========================	==================================	===================	====================
->d_hash(d, ...) 		none - any call might be		d
->d_compare(d, ...)		none - any call might be		d
->d_revalidate(d, f)		f & LOOKUP_RCU				d			-ECHILD
->d_manage(d, f)		f					d			-ECHILD
->permission(i, m)		m & MAY_NOT_BLOCK			i			-ECHILD
->get_link(d, i, ...)		d == NULL				i			ERR_PTR(-ECHILD)
->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i			ERR_PTR(-ECHILD)
========================	==================================	===================	====================

Additionally, callback set by set_delayed_call() from unsafe call of
->get_link() will be run in the same environment; that one is usually not
a problem, though.

For the sake of completeness, three of LSM methods
(->inode_permission(), ->inode_follow_link() and ->task_to_inode())
might be called in similar environment, but that's a problem for LSM
crowd, not for filesystem folks.

Opting out
==========

To large extent a filesystem can opt out of RCU pathwalk; that loses all
scalability benefits whenever your filesystem gets involved in pathname
resolution, though.  If that's the way you choose to go, just make sure
that

1. any non-default ->d_revalidate(), ->permission(), ->get_link() and
->get_inode_acl() instance bails out if called by RCU pathwalk (see below
for details).  Costs a couple of lines of boilerplate in each.

2. if some symlink inodes have ->i_link set to a dynamically allocated
object, that object won't be freed without an RCU delay.  Anything
coallocated with inode is fine, so's anything freed from ->free_inode().
Usually comes for free, just remember to avoid freeing directly
from ->destroy_inode().

3. any ->d_hash() and ->d_compare() instances (if you have those) do
not access any filesystem objects.

4. there's no ->d_manage() instances in your filesystem.

If your case does not fit the above, the easy opt-out is not for you.
If so, you'll have to keep reading...

What is guaranteed and what is required?
========================================

Any method call is, of course, required not to crash - no stepping on
freed memory, etc.  All of the unsafe calls listed above are done under
rcu_read_lock(), so they are not allowed to block.  Further requirements
vary between the methods.

Before going through the list of affected methods, several notes on
the things that *are* guaranteed:

* if a reference to struct dentry is passed to such call, it will
  not be freed until the method returns.  The same goes for a reference to
  struct inode and to struct super_block pointed to by ->d_sb or ->i_sb
  members of dentry and inode respectively.  Any of those might be in
  process of being torn down or enter such state right under us;
  the entire point of those unsafe calls is that we make them without
  telling anyone they'd need to wait for us.

* following ->d_parent and ->d_inode of such dentries is fine,
  provided that it's done by READ_ONCE() (for ->d_inode the preferred
  form is d_inode_rcu(dentry)).  The value of ->d_parent is never going
  to be NULL and it will again point to a struct dentry that will not be
  freed until the method call finishes.  The value of ->d_inode might
  be NULL; if non-NULL, it'll be pointing to a struct inode that will
  not be freed until the method call finishes.

* none of the inodes passed to an unsafe call could have reached
  fs/inode.c:evict() before the caller grabbed rcu_read_lock().

* for inodes 'not freed' means 'not entered ->free_inode()', so
  anything that won't be destroyed until ->free_inode() is safe to access.
  Anything synchronously destroyed in ->evict_inode() or ->destroy_inode()
  is not safe; however, one can count upon the call_rcu() callbacks
  issued in those yet to be entered.  Note that unlike dentries and
  superblocks, inodes are embedded into filesystem-private objects;
  anything stored directly in the containing object is safe to access.

* for dentries anything destroyed by ->d_prune() (synchronously or
  not) is not safe; the same goes for the things synchronously destroyed
  by ->d_release().  However, call_rcu() callbacks issued in ->d_release()
  are yet to be entered.

* for superblocks we can count upon call_rcu() callbacks issued
  from inside the ->kill_sb() (including the ones issued from
  ->put_super()) yet to be entered.  You can also count upon
  ->s_user_ns still being pinned and ->s_security still not
  freed.

* NOTE: we **can not** count upon the things like ->d_parent
  being positive (or a directory); a race with rename()+rmdir()+mknod()
  and you might find a FIFO as parent's inode.  NULL is even easier -
  just have the dentry and its ex-parent already past dentry_kill()
  (which is a normal situation for eviction on memory pressure) and there
  you go.  Normally such pathologies are prevented by the locking (and
  dentry refcounting), but... the entire point of that stuff is to avoid
  informing anyone that we are there, so those mechanisms are bypassed.
  What's more, if dentry is not pinned by refcount, grabbing its ->d_lock
  will *not* suffice to prevent that kind of mess - the scenario with
  eviction by memory pressure won't be prevented by that; you might have
  grabbed ->d_lock only after the dentry_kill() had released it, and
  at that point ->d_parent still points to what used to be the parent,
  but there's nothing to prevent its eviction.

->d_compare()
-------------

For ->d_compare() we just need to make sure it won't crash
when called for dying dentry - an incorrect return value won't harm the
caller in such case.  False positives and false negatives alike - the
callers take care of that.  To be pedantic, make that "false positives
do not cause problems unless they have ->d_manage()", but ->d_manage()
is present only on autofs and there's no autofs ->d_compare() instances.
See [1] for details, if you are curious.

There is no indication that ->d_compare() is called in RCU mode;
the majority of callers are such, anyway, so we need to cope with that.
VFS guarantees that dentry won't be freed under us; the same goes for
the superblock pointed to by its ->d_sb.  Name points to memory object
that won't get freed under us and length does not exceed the size of
that object.  The contents of that object is *NOT* guaranteed to be
stable; d_move() might race with us, modifying the name.  However, in
that case we are free to return an arbitrary result - the callers will
take care of both false positives and false negatives in such case.
The name we are comparing dentry with (passed in qstr) is stable,
thankfully...

If we need to access any other data, it's up to the filesystem
to protect it.  In practice it means that destruction of fs-private part
of superblock (and possibly unicode tables hanging off it, etc.) might
need to be RCU-delayed.

*IF* you want the behaviour that varies depending upon the parent
directory, you get to be very careful with READ_ONCE() and watch out
for the object lifetimes.

Basically, if the things get that tricky, ask for help.
Currently there are two such instances in the tree - proc_sys_compare()
and generic_ci_d_compare().  Both are... special.

->d_hash()
----------

For ->d_hash() on a dying dentry we are free to report any hash
value; the only extra requirement is that we should not return stray
hard errors.  In other words, if we return anything other than 0 or
-ECHILD, we'd better make sure that this error would've been correct
before the parent started dying.  Since ->d_hash() error reporting is
usually done to reject unacceptable names (too long, contain unsuitable
characters for this filesystem, etc.), that's really not a problem -
hard errors depend only upon the name, not the parent.

Again, VFS guarantees that freeing of dentry and of the superblock
pointed to by dentry->d_sb won't happen under us.  The name passed to
us (in qstr) is stable.  If you need anything beyond that, you are
in the same situation as with ->d_compare().  Might want to RCU-delay
freeing private part of superblock (if that's what we need to access),
might want the same for some objects hanging off that (unicode tables,
etc.).  If you need something beyond that - ask for help.

->d_revalidate()
----------------

For this one we do have an indication of call being unsafe -
flags & LOOKUP_RCU.  With ->d_revalidate we are always allowed to bail
out and return -ECHILD; that will have the caller drop out of RCU mode.
We definitely need to do that if revalidate would require any kind of IO,
mutex-taking, etc.; we can't block in RCU mode.

Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU
in flags as "return -ECHILD and be done with that"; it's guaranteed to
do the right thing, but you lose the benefits of RCU pathwalks whenever
you run into such dentry.

Same as with the previous methods, we are guaranteed that
dentry and dentry->d_sb won't be freed under us.  We are also guaranteed
that ->d_parent (which is *not* stable, so use READ_ONCE) points to a
struct dentry that won't get freed under us.  As always with ->d_parent,
it's not NULL - for a detached dentry it will point to dentry itself.
d_inode_rcu() of dentry and its parent will be either NULL or will
point to a struct inode that won't get freed under us.  Anything beyond
than that is not guaranteed.  We may find parent to be negative - it can
happen if we race with d_move() and removal of old parent.  In that case
just return -ECHILD and be done with that.

On non-RCU side you could use dget_parent() instead - that
would give a positive dentry and its ->d_inode would remain stable.
dget_parent() has to be paired with dput(), though, so it's not usable
in RCU mode.

If you need fs-private objects associated with dentry, its parent
inode(s) or superblock - see the general notes above on how to access
those.

->d_manage()
------------

Can be called in RCU mode; gets an argument telling it if it has
been called so.  Pretty much autofs-only; for everyone's sanity sake,
don't inflict more of those on the kernel.  Definitely don't do that
without asking first...

->permission()
--------------

Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK
in mask, and it can only happen for MAY_EXEC checks on directories.
In RCU mode it is not allowed to block, and it is allowed to bail out
by returning -ECHILD.  It might be called for an inode that is getting
torn down, possibly along with its filesystem.  Errors other than -ECHILD
should only be returned if they would've been returned in non-RCU mode;
several instances in procfs did (prior to 6.8-rc6) run afoul of that one.
That's an instructive example, BTW - what happens is that proc_pid_permission()
uses proc_get_task() to find the relevant process.  proc_get_task()
uses PID reference stored in struct proc_inode our inode is embedded
into; inode can't have been freed yet, so fetching ->pid member in that
is safe.  However, using the value you've fetched is a different story
- proc_evict_inode() would have passed it to put_pid() and replaced
it with NULL.  Unsafe caller has no way to tell if that is happening
right under it.  Solution: stop zeroing ->pid in proc_evict_inode()
and move put_pid() from proc_pid_evict_inode() to proc_free_inode().
That's not all that is needed (there's access to procfs-private part of
superblock as well), but it does make a good example of how such stuff
can be dealt with.

Note that idmap argument is safe on all calls - its destruction
is rcu-delayed.

The amount of headache is seriously reduced (for now) by the fact
that a lot of instances boil down to generic_permission() (which will
do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK.
If we ever extend RCU mode to other ->permission() callers, the thing will
get interesting; that's not likely to happen, though, unless access(2)
goes there [this is NOT a suggestion, folks].

->get_link()
------------

Again, this can be called in RCU mode.  Even if your ->d_revalidate()
always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
you can't assume that ->get_link() won't be reached (see [2] for
details).

NULL dentry argument is an indicator of unsafe call; if you can't handle
it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
with this method you really might need that) should be done with GFP_ATOMIC
in the unsafe case.

Whatever you pass to set_delayed_call() is going to be called
in the same mode as ->get_link() itself; not a problem for most of the
instances.  The string you return needs to stay there until the
callback gets called or, if no callback is set, until at least the
freeing of inode.  As usual, for an unsafe call the inode might be
in process of teardown, possibly along with the hosting filesystem.
The usual considerations apply.  The same, BTW, applies to whatever
you set in ->i_link - it must stay around at least until ->free_inode().

->get_inode_acl()
-----------------

Very limited exposure for that one - unsafe call is possible
only if you explicitly set ACL_DONT_CACHE as cached ACL value.
Only two filesystems (fuse and overlayfs) even bother.  Unsafe call
is indicated by explicit flag (the third argument of the method),
bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
apply for any access to data structures you might need to do.

Footnotes
=========

[0]  The fast path of pathname resolution really can run into a dentry on
a filesystem that is getting shut down.

Here's one of the scenarios for that to happen:

	1. have two threads sharing fs_struct chdir'ed on that filesystem.
	2. lazy-umount it, so that the only thing holding it alive is
	   cwd of these threads.
	3. the first thread does relative pathname resolution
	   and gets to e.g. ->d_hash().  It's holding rcu_read_lock().
	4. at the same time the second thread does fchdir(), moving to
	   different directory.

In fchdir(2) we get to set_fs_pwd(), which set the current directory
to the new place and does mntput() on the old one.  No RCU delays here,
we calculate the refcount of that mount and see that we are dropping
the last reference.  We make sure that the pathwalk in progress in
the first thread will fail when it comes to legitimize_mnt() and do this
(in mntput_no_expire())::

	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
		return;

As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
on our mount, which hits deactivate_super().  That was the last reference to
superblock.

Voila - we have a filesystem shutdown right under the nose of a thread
running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
one can arrange the same for other methods called by rcu pathwalk.

It's not easy to hit (especially if you want to get through the
entire ->kill_sb() before the first thread gets through ->d_hash()),
and it's probably impossible on the real hardware; on KVM it might be
borderline doable.  However, it is possible and I would not swear that
other ways of arranging the same thing are equally hard to hit.

The bottom line: methods that can be called in RCU mode need to
be careful about the per-superblock objects destruction.

[1]

Some callers prevent being called for dying dentry (holding ->d_lock and
having verified !d_unhashed() or finding it in the list of inode's aliases
under ->i_lock).  For those the scenario in question simply cannot arise.

Some follow the match with lockref_get_not_dead() and treat the failure
as mismatch.  That takes care of false positives, and false negatives on
dying dentry are still correct - we simply pretend to have lost the race.

The only caller that does not fit into the classes above is
__d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
!d_unhashed() before calling ->d_compare().  That is not enough to
prevent dentry from starting to die right under us; however, the sampled
value of ->d_seq will be rechecked when the caller gets to step_into(),
so for a false positive we will end up with a mismatch.  The corner case
around ->d_manage() is due to the handle_mounts() done before step_into()
gets to ->d_seq validation...

[2]

binding a symlink on top of a regular file on another filesystem is possible
and that's all it takes for RCU pathwalk to get there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][v3] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-27  8:04   ` [RFC][v3] " Al Viro
@ 2024-02-28  4:20     ` Akira Yokosawa
  2024-02-28 13:02     ` [RFC][v4] " Al Viro
  1 sibling, 0 replies; 8+ messages in thread
From: Akira Yokosawa @ 2024-02-28  4:20 UTC (permalink / raw
  To: viro; +Cc: linux-doc, linux-fsdevel, torvalds, Akira Yokosawa

Hello Al,

I'm not a VFS/FS person, but just wanted to suggest proper formatting
of footnotes in ReST.

On Tue, 27 Feb 2024 03:04:59 -0500, Al Viro wrote:
...
> On Mon, Feb 26, 2024 at 01:35:12PM -0500, Al Viro wrote:
>> Updated variant follows.  Changes since the previous:
>> * beginning has been rewritten
>> * typos spotted by Randy should be fixed
>> * dumb braino in "opt out" part fixed (->d_automount is irrelevant, it's
>> ->d_manage one needs to watch out for)
>> * a stale bit in discussion of ->permission() ("currently (6.5) run afoul")
>> updated (to "did (prior to 6.8-rc6) run afoul").
>> * I gave up on ReST footnotes and did something similar manually.

But first, one of enumerated lists here:

> Opting out
> ==========
> 
> To large extent a filesystem can opt out of RCU pathwalk; that loses all
> scalability benefits whenever your filesystem gets involved in pathname
> resolution, though.  If that's the way you choose to go, just make sure
> that
> 
> 1. any non-default ->d_revalidate(), ->permission(), ->get_link() and
> ->get_inode_acl() instance bails out if called by RCU pathwalk (see below
> for details).  Costs a couple of lines of boilerplate in each.
> 
> 2. if some symlink inodes have ->i_link set to a dynamically allocated
> object, that object won't be freed without an RCU delay.  Anything
> coallocated with inode is fine, so's anything freed from ->free_inode().
> Usually comes for free, just remember to avoid freeing directly
> from ->destroy_inode().
> 
> 3. any ->d_hash() and ->d_compare() instances (if you have those) do
> not access any filesystem objects.
> 
> 4. there's no ->d_manage() instances in your filesystem.
>
> If your case does not fit the above, the easy opt-out is not for you.
> If so, you'll have to keep reading...

is not recognized by Sphinx due to missing indents.

You need to indent each item as shown bellow:

----------8-<---------------
Opting out
==========

To large extent a filesystem can opt out of RCU pathwalk; that loses all
scalability benefits whenever your filesystem gets involved in pathname
resolution, though.  If that's the way you choose to go, just make sure
that

1. any non-default ->d_revalidate(), ->permission(), ->get_link() and
   ->get_inode_acl() instance bails out if called by RCU pathwalk (see below
   for details).  Costs a couple of lines of boilerplate in each.

2. if some symlink inodes have ->i_link set to a dynamically allocated
   object, that object won't be freed without an RCU delay.  Anything
   coallocated with inode is fine, so's anything freed from ->free_inode().
   Usually comes for free, just remember to avoid freeing directly
   from ->destroy_inode().

3. any ->d_hash() and ->d_compare() instances (if you have those) do
   not access any filesystem objects.

4. there's no ->d_manage() instances in your filesystem.

If your case does not fit the above, the easy opt-out is not for you.
If so, you'll have to keep reading...
----------8-<---------------

Missing indents is also the reason why your footnotes didn't work as you
expected.

Diff below partially reverts your v2 changes WRT footnotes and does
proper indents in footnote contents area.

----------8-<---------------
--- a/Documentation/filesystems/rcu-exposure.rst
+++ b/Documentation/filesystems/rcu-exposure.rst
@@ -42,7 +42,7 @@ caller deliberately does *not* take steps that would normally protect
 the object from being torn down by another thread while the method is
 trying to work with it.  That applies not just to dentries - associated
 inodes and even the filesystem instance the object belongs to could be
-in process of getting torn down (yes, really - see [0] for details).
+in process of getting torn down (yes, really).\ [#f0]_
 Of course, from the filesystem POV every call like that is a potential
 source of headache.
 
@@ -188,8 +188,8 @@ when called for dying dentry - an incorrect return value won't harm the
 caller in such case.  False positives and false negatives alike - the
 callers take care of that.  To be pedantic, make that "false positives
 do not cause problems unless they have ->d_manage()", but ->d_manage()
-is present only on autofs and there's no autofs ->d_compare() instances.
-See [1] for details, if you are curious.
+is present only on autofs and there's no autofs ->d_compare()
+instances.\ [#f1]_
 
 There is no indication that ->d_compare() is called in RCU mode;
 the majority of callers are such, anyway, so we need to cope with that.
@@ -321,8 +321,7 @@ goes there [this is NOT a suggestion, folks].
 
 Again, this can be called in RCU mode.  Even if your ->d_revalidate()
 always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
-you can't assume that ->get_link() won't be reached (see [2] for
-details).
+you can't assume that ->get_link() won't be reached.\ [#f2]_
 
 NULL dentry argument is an indicator of unsafe call; if you can't handle
 it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
@@ -350,13 +349,14 @@ bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
 apply for any access to data structures you might need to do.
 
 
-Footnotes
-=========
+.. rubric:: Footnotes
 
-[0]  The fast path of pathname resolution really can run into a dentry on
-a filesystem that is getting shut down.
+.. [#f0]
 
-Here's one of the scenarios for that to happen:
+  The fast path of pathname resolution really can run into a dentry on
+  a filesystem that is getting shut down.
+
+  Here's one of the scenarios for that to happen:
 
 	1. have two threads sharing fs_struct chdir'ed on that filesystem.
 	2. lazy-umount it, so that the only thing holding it alive is
@@ -366,54 +366,54 @@ Here's one of the scenarios for that to happen:
 	4. at the same time the second thread does fchdir(), moving to
 	   different directory.
 
-In fchdir(2) we get to set_fs_pwd(), which set the current directory
-to the new place and does mntput() on the old one.  No RCU delays here,
-we calculate the refcount of that mount and see that we are dropping
-the last reference.  We make sure that the pathwalk in progress in
-the first thread will fail when it comes to legitimize_mnt() and do this
-(in mntput_no_expire())::
+  In fchdir(2) we get to set_fs_pwd(), which set the current directory
+  to the new place and does mntput() on the old one.  No RCU delays here,
+  we calculate the refcount of that mount and see that we are dropping
+  the last reference.  We make sure that the pathwalk in progress in
+  the first thread will fail when it comes to legitimize_mnt() and do this
+  (in mntput_no_expire())::
 
 	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
 	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
 		return;
 
-As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
-on our mount, which hits deactivate_super().  That was the last reference to
-superblock.
+  As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt()
+  on our mount, which hits deactivate_super().  That was the last reference to
+  superblock.
 
-Voila - we have a filesystem shutdown right under the nose of a thread
-running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
-one can arrange the same for other methods called by rcu pathwalk.
+  Voila - we have a filesystem shutdown right under the nose of a thread
+  running in ->d_hash() of something on that filesystem.  Mutatis mutandis,
+  one can arrange the same for other methods called by rcu pathwalk.
 
-It's not easy to hit (especially if you want to get through the
-entire ->kill_sb() before the first thread gets through ->d_hash()),
-and it's probably impossible on the real hardware; on KVM it might be
-borderline doable.  However, it is possible and I would not swear that
-other ways of arranging the same thing are equally hard to hit.
+  It's not easy to hit (especially if you want to get through the
+  entire ->kill_sb() before the first thread gets through ->d_hash()),
+  and it's probably impossible on the real hardware; on KVM it might be
+  borderline doable.  However, it is possible and I would not swear that
+  other ways of arranging the same thing are equally hard to hit.
 
-The bottom line: methods that can be called in RCU mode need to
-be careful about the per-superblock objects destruction.
+  The bottom line: methods that can be called in RCU mode need to
+  be careful about the per-superblock objects destruction.
 
-[1]
+.. [#f1]
 
-Some callers prevent being called for dying dentry (holding ->d_lock and
-having verified !d_unhashed() or finding it in the list of inode's aliases
-under ->i_lock).  For those the scenario in question simply cannot arise.
+  Some callers prevent being called for dying dentry (holding ->d_lock and
+  having verified !d_unhashed() or finding it in the list of inode's aliases
+  under ->i_lock).  For those the scenario in question simply cannot arise.
 
-Some follow the match with lockref_get_not_dead() and treat the failure
-as mismatch.  That takes care of false positives, and false negatives on
-dying dentry are still correct - we simply pretend to have lost the race.
+  Some follow the match with lockref_get_not_dead() and treat the failure
+  as mismatch.  That takes care of false positives, and false negatives on
+  dying dentry are still correct - we simply pretend to have lost the race.
 
-The only caller that does not fit into the classes above is
-__d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
-!d_unhashed() before calling ->d_compare().  That is not enough to
-prevent dentry from starting to die right under us; however, the sampled
-value of ->d_seq will be rechecked when the caller gets to step_into(),
-so for a false positive we will end up with a mismatch.  The corner case
-around ->d_manage() is due to the handle_mounts() done before step_into()
-gets to ->d_seq validation...
+  The only caller that does not fit into the classes above is
+  __d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
+  !d_unhashed() before calling ->d_compare().  That is not enough to
+  prevent dentry from starting to die right under us; however, the sampled
+  value of ->d_seq will be rechecked when the caller gets to step_into(),
+  so for a false positive we will end up with a mismatch.  The corner case
+  around ->d_manage() is due to the handle_mounts() done before step_into()
+  gets to ->d_seq validation...
 
-[2]
+.. [#f2]
 
-binding a symlink on top of a regular file on another filesystem is possible
-and that's all it takes for RCU pathwalk to get there.
+  binding a symlink on top of a regular file on another filesystem is possible
+  and that's all it takes for RCU pathwalk to get there.
----------8-<---------------

This should generate footnotes both in PDF and HTML outputs.

Note that the pattern "foo bar.\ [#fn]_" in the referencing sites is to
place the footnote markers just next to the punctuation. In PDF, it should
prevent line breaks in front of the markers.  HTML doesn't behave that
way, it seems.

HTH, Akira


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC][v4] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV
  2024-02-27  8:04   ` [RFC][v3] " Al Viro
  2024-02-28  4:20     ` Akira Yokosawa
@ 2024-02-28 13:02     ` Al Viro
  1 sibling, 0 replies; 8+ messages in thread
From: Al Viro @ 2024-02-28 13:02 UTC (permalink / raw
  To: linux-fsdevel; +Cc: linux-doc, Linus Torvalds

On Tue, Feb 27, 2024 at 03:04:59AM -0500, Al Viro wrote:
> On Mon, Feb 26, 2024 at 01:35:12PM -0500, Al Viro wrote:
> > On Sun, Feb 25, 2024 at 05:06:40PM -0500, Al Viro wrote:
> > > 	The text below is a rough approximation to what should, IMO,
> > > end up in Documentation/filesystems/rcu-exposure.rst.  It started
> > > as a part of audit notes.  Intended target audience of the text is
> > > filesystem maintainers and/or folks who are reviewing fs code;
> > > I hope the current contents is already useful to an extent, but
> > > I'm pretty certain that it needs more massage before it could
> > > go into the tree.
> > > 
> > > 	Please, read and review - comments and suggestions would be
> > > very welcome, both for contents and for markup - I'm *not* familiar
> > > with ReST or the ways it's used in the kernel (in particular, footnotes
> > > seem to get lost in PDF).
> > 
> > Updated variant follows.  Changes since the previous:
> > * beginning has been rewritten
> > * typos spotted by Randy should be fixed
> > * dumb braino in "opt out" part fixed (->d_automount is irrelevant, it's
> > ->d_manage one needs to watch out for)
> > * a stale bit in discussion of ->permission() ("currently (6.5) run afoul")
> > updated (to "did (prior to 6.8-rc6) run afoul").
> > * I gave up on ReST footnotes and did something similar manually.
> 
> Hopefully that one is better - the introductory part should be less handwavy
> and easier to follow now...

Changes since v3:
* used the suggestion by Akira Yokosawa re fixing the footnotes
* another rewrite of introductory part.

Please, review.

===================================================================
The ways in which RCU pathwalk can ruin filesystem maintainer's day
===================================================================

Pathname resolution fast path: the scarier conditions for filesystem code
=========================================================================

The basic safety assumption for any multithreaded OO code is that
objects passed to a function will not have been already destroyed by
another thread.  It may guaranteed by various mechanisms (refcounting,
some form of exclusion, not having pointers to the object visible to other
threads, etc.), but no matter how it's done, the caller is expected to
have done that.

Usually one wants a stronger warranty - that destructors of any objects
passed to a function would not be called by other threads until the
function has either done something that explicitly allows that to happen
(e.g. dropped a reference to refcounted object that had been passed
to it, stored a pointer in a shared data structure, etc.) or finished
its execution.

Filesystem methods **usually** can count upon such warranties regarding
the lifetime of objects they are called to act upon.

However, there is one important case where providing such warranties
is rather costly.  On the fast path in pathname resolution (the case
when everything we need is in VFS caches), grabbing and dropping dentry
references for every visited directory would cause unpleasant scalability
issues.

The problem is that access patterns are heavily biased; every system call
getting an absolute pathname will have to start at root directory, etc.
Having each of them in effect write "I'd been here" on the same memory
objects would cost quite a bit.

To deal with that we try to keep the fast path stores-free, bumping
no refcounts and taking no locks.  Details are described elsewhere
(Documentation/filesystems/path-lookup.txt), but the bottom line for
filesystems is that the methods involved in fast path of pathname
resolution may be called with much looser warranties than usual.
The caller deliberately does *not* take the steps that would normally
protect the object from being torn down by another thread while the
method is trying to work with it.  This applies not just to dentries -
associated inodes and even the filesystem instance the object belongs
to could be in process of getting torn down (yes, really).\ [#f0]_
Of course, from the filesystem point of view, every call like that is
a potential source of headache.

Such unsafe method calls do get *some* warranties.  Before going into
the details, keep in mind that

* few methods are affected, and their defaults (i.e. what we get if the
  method is left NULL) are safe.  If your filesystem does not need to
  override any of those defaults, you are fine (there is a minor nit
  regarding the cached fast symlinks, but that's easy to take care of).

* for most of those methods there is a way to bail out and tell VFS to
  leave the fast path and switch to holding proper references from
  that point on.  That's what has to happen when an instance sees that it
  can't go on without a blocking operation, but you can use the same
  mechanism to bail out as soon as you see an unsafe call.

Which methods are affected?
===========================

	The list of the methods that could run into that fun:

========================	==================================	===================	====================
	method			indication that the call is unsafe	unprotected objects	bailout return value
========================	==================================	===================	====================
->d_hash(d, ...) 		none - any call might be		d
->d_compare(d, ...)		none - any call might be		d
->d_revalidate(d, f)		f & LOOKUP_RCU				d			-ECHILD
->d_manage(d, f)		f					d			-ECHILD
->permission(i, m)		m & MAY_NOT_BLOCK			i			-ECHILD
->get_link(d, i, ...)		d == NULL				i			ERR_PTR(-ECHILD)
->get_inode_acl(i, t, f)	f == LOOKUP_RCU				i			ERR_PTR(-ECHILD)
========================	==================================	===================	====================

Additionally, callback set by set_delayed_call() from unsafe call of
->get_link() will be run in the same environment; that one is usually not
a problem, though.

For the sake of completeness, three of LSM methods
(->inode_permission(), ->inode_follow_link() and ->task_to_inode())
might be called in similar environment, but that's a problem for LSM
crowd, not for filesystem folks.

Opting out
==========

To large extent a filesystem can opt out of RCU pathwalk; that loses all
scalability benefits whenever your filesystem gets involved in pathname
resolution, though.  If that's the way you choose to go, just make sure
that

1. any non-default ->d_revalidate(), ->permission(), ->get_link() and
   ->get_inode_acl() instance bails out if called by RCU pathwalk (see
   below for details).  Costs a couple of lines of boilerplate in each.

2. if some symlink inodes have ->i_link set to a dynamically allocated
   object, that object won't be freed without an RCU delay.  Anything
   coallocated with inode is fine, so's anything freed from ->free_inode().
   Usually comes for free, just remember to avoid freeing directly
   from ->destroy_inode().

3. any ->d_hash() and ->d_compare() instances (if you have those) do
   not access any filesystem objects.

4. there's no ->d_manage() instances in your filesystem.

If your case does not fit the above, the easy opt-out is not for you.
If so, you'll have to keep reading...

What is guaranteed and what is required?
========================================

Any method call is, of course, required not to crash - no stepping on
freed memory, etc.  All of the unsafe calls listed above are done under
rcu_read_lock(), so they are not allowed to block.  Further requirements
vary between the methods.

Before going through the list of affected methods, several notes on
the things that *are* guaranteed:

* if a reference to struct dentry is passed to such call, it will
  not be freed until the method returns.  The same goes for a reference to
  struct inode and to struct super_block pointed to by ->d_sb or ->i_sb
  members of dentry and inode respectively.  Any of those might be in
  process of being torn down or enter such state right under us;
  the entire point of those unsafe calls is that we make them without
  telling anyone they'd need to wait for us.

* following ->d_parent and ->d_inode of such dentries is fine,
  provided that it's done by READ_ONCE() (for ->d_inode the preferred
  form is d_inode_rcu(dentry)).  The value of ->d_parent is never going
  to be NULL and it will again point to a struct dentry that will not be
  freed until the method call finishes.  The value of ->d_inode might
  be NULL; if non-NULL, it'll be pointing to a struct inode that will
  not be freed until the method call finishes.

* none of the inodes passed to an unsafe call could have reached
  fs/inode.c:evict() before the caller grabbed rcu_read_lock().

* for inodes 'not freed' means 'not entered ->free_inode()', so
  anything that won't be destroyed until ->free_inode() is safe to access.
  Anything synchronously destroyed in ->evict_inode() or ->destroy_inode()
  is not safe; however, one can count upon the call_rcu() callbacks
  issued in those yet to be entered.  Note that unlike dentries and
  superblocks, inodes are embedded into filesystem-private objects;
  anything stored directly in the containing object is safe to access.

* for dentries anything destroyed by ->d_prune() (synchronously or
  not) is not safe; the same goes for the things synchronously destroyed
  by ->d_release().  However, call_rcu() callbacks issued in ->d_release()
  are yet to be entered.

* for superblocks we can count upon call_rcu() callbacks issued
  from inside the ->kill_sb() (including the ones issued from
  ->put_super()) yet to be entered.  You can also count upon
  ->s_user_ns still being pinned and ->s_security still not
  freed.

* NOTE: we **can not** count upon the things like ->d_parent
  being positive (or a directory); a race with rename()+rmdir()+mknod()
  and you might find a FIFO as parent's inode.  NULL is even easier -
  just have the dentry and its ex-parent already past dentry_kill()
  (which is a normal situation for eviction on memory pressure) and there
  you go.  Normally such pathologies are prevented by the locking (and
  dentry refcounting), but... the entire point of that stuff is to avoid
  informing anyone that we are there, so those mechanisms are bypassed.
  What's more, if dentry is not pinned by refcount, grabbing its ->d_lock
  will *not* suffice to prevent that kind of mess - the scenario with
  eviction by memory pressure won't be prevented by that; you might have
  grabbed ->d_lock only after the dentry_kill() had released it, and
  at that point ->d_parent still points to what used to be the parent,
  but there's nothing to prevent its eviction.

->d_compare()
-------------

For ->d_compare() we just need to make sure it won't crash when called
for dying dentry - an incorrect return value won't harm the caller in
such case.  False positives and false negatives alike - the callers take
care of that.  To be pedantic, make that "false positives do not cause
problems unless they have ->d_manage()", but ->d_manage() is present
only on autofs and there's no autofs ->d_compare() instances.\ [#f1]_

There is no indication that ->d_compare() is called in RCU mode;
the majority of callers are such, anyway, so we need to cope with that.
VFS guarantees that dentry won't be freed under us; the same goes for
the superblock pointed to by its ->d_sb.  Name points to memory object
that won't get freed under us and length does not exceed the size of
that object.  The contents of that object is *NOT* guaranteed to be
stable; d_move() might race with us, modifying the name.  However, in
that case we are free to return an arbitrary result - the callers will
take care of both false positives and false negatives in such case.
The name we are comparing dentry with (passed in qstr) is stable,
thankfully...

If we need to access any other data, it's up to the filesystem
to protect it.  In practice it means that destruction of fs-private part
of superblock (and possibly unicode tables hanging off it, etc.) might
need to be RCU-delayed.

*IF* you want the behaviour that varies depending upon the parent
directory, you get to be very careful with READ_ONCE() and watch out
for the object lifetimes.

Basically, if the things get that tricky, ask for help.
Currently there are two such instances in the tree - proc_sys_compare()
and generic_ci_d_compare().  Both are... special.

->d_hash()
----------

For ->d_hash() on a dying dentry we are free to report any hash
value; the only extra requirement is that we should not return stray
hard errors.  In other words, if we return anything other than 0 or
-ECHILD, we'd better make sure that this error would've been correct
before the parent started dying.  Since ->d_hash() error reporting is
usually done to reject unacceptable names (too long, contain unsuitable
characters for this filesystem, etc.), that's really not a problem -
hard errors depend only upon the name, not the parent.

Again, VFS guarantees that freeing of dentry and of the superblock
pointed to by dentry->d_sb won't happen under us.  The name passed to
us (in qstr) is stable.  If you need anything beyond that, you are
in the same situation as with ->d_compare().  Might want to RCU-delay
freeing private part of superblock (if that's what we need to access),
might want the same for some objects hanging off that (unicode tables,
etc.).  If you need something beyond that - ask for help.

->d_revalidate()
----------------

For this one we do have an indication of call being unsafe -
flags & LOOKUP_RCU.  With ->d_revalidate we are always allowed to bail
out and return -ECHILD; that will have the caller drop out of RCU mode.
We definitely need to do that if revalidate would require any kind of IO,
mutex-taking, etc.; we can't block in RCU mode.

Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU
in flags as "return -ECHILD and be done with that"; it's guaranteed to
do the right thing, but you lose the benefits of RCU pathwalks whenever
you run into such dentry.

Same as with the previous methods, we are guaranteed that
dentry and dentry->d_sb won't be freed under us.  We are also guaranteed
that ->d_parent (which is *not* stable, so use READ_ONCE) points to a
struct dentry that won't get freed under us.  As always with ->d_parent,
it's not NULL - for a detached dentry it will point to dentry itself.
d_inode_rcu() of dentry and its parent will be either NULL or will
point to a struct inode that won't get freed under us.  Anything beyond
than that is not guaranteed.  We may find parent to be negative - it can
happen if we race with d_move() and removal of old parent.  In that case
just return -ECHILD and be done with that.

On non-RCU side you could use dget_parent() instead - that
would give a positive dentry and its ->d_inode would remain stable.
dget_parent() has to be paired with dput(), though, so it's not usable
in RCU mode.

If you need fs-private objects associated with dentry, its parent
inode(s) or superblock - see the general notes above on how to access
those.

->d_manage()
------------

Can be called in RCU mode; gets an argument telling it if it has
been called so.  Pretty much autofs-only; for everyone's sanity sake,
don't inflict more of those on the kernel.  Definitely don't do that
without asking first...

->permission()
--------------

Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK
in mask, and it can only happen for MAY_EXEC checks on directories.
In RCU mode it is not allowed to block, and it is allowed to bail out
by returning -ECHILD.  It might be called for an inode that is getting
torn down, possibly along with its filesystem.  Errors other than -ECHILD
should only be returned if they would've been returned in non-RCU mode;
several instances in procfs did (prior to 6.8-rc6) run afoul of that one.
That's an instructive example, BTW - what happens is that proc_pid_permission()
uses proc_get_task() to find the relevant process.  proc_get_task()
uses PID reference stored in struct proc_inode our inode is embedded
into; inode can't have been freed yet, so fetching ->pid member in that
is safe.  However, using the value you've fetched is a different story
- proc_evict_inode() would have passed it to put_pid() and replaced
it with NULL.  Unsafe caller has no way to tell if that is happening
right under it.  Solution: stop zeroing ->pid in proc_evict_inode()
and move put_pid() from proc_pid_evict_inode() to proc_free_inode().
That's not all that is needed (there's access to procfs-private part of
superblock as well), but it does make a good example of how such stuff
can be dealt with.

Note that idmap argument is safe on all calls - its destruction
is rcu-delayed.

The amount of headache is seriously reduced (for now) by the fact
that a lot of instances boil down to generic_permission() (which will
do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK.
If we ever extend RCU mode to other ->permission() callers, the thing will
get interesting; that's not likely to happen, though, unless access(2)
goes there [this is NOT a suggestion, folks].

->get_link()
------------

Again, this can be called in RCU mode.  Even if your ->d_revalidate()
always returns -ECHILD in RCU mode and kicks the pathwalk out of it,
you can't assume that ->get_link() won't be reached.\ [#f2]_

NULL dentry argument is an indicator of unsafe call; if you can't handle
it, just return ERR_PTR(-ECHILD).  Any allocations you need to do (and
with this method you really might need that) should be done with GFP_ATOMIC
in the unsafe case.

Whatever you pass to set_delayed_call() is going to be called
in the same mode as ->get_link() itself; not a problem for most of the
instances.  The string you return needs to stay there until the
callback gets called or, if no callback is set, until at least the
freeing of inode.  As usual, for an unsafe call the inode might be
in process of teardown, possibly along with the hosting filesystem.
The usual considerations apply.  The same, BTW, applies to whatever
you set in ->i_link - it must stay around at least until ->free_inode().

->get_inode_acl()
-----------------

Very limited exposure for that one - unsafe call is possible
only if you explicitly set ACL_DONT_CACHE as cached ACL value.
Only two filesystems (fuse and overlayfs) even bother.  Unsafe call
is indicated by explicit flag (the third argument of the method),
bailout is done by returning ERR_PTR(-CHILD) and the usual considerations
apply for any access to data structures you might need to do.

.. rubric:: Footnotes                                                                                                              

.. [#f0]

  The fast path of pathname resolution really can run into a dentry on
  a filesystem that is getting shut down.

  Here's one of the scenarios for that to happen:

	1. have two threads sharing fs_struct chdir'ed on that filesystem.
	2. lazy-umount it, so that the only thing holding it alive is
	   cwd of these threads.
	3. the first thread does relative pathname resolution
	   and gets to e.g. ->d_hash().  It's holding rcu_read_lock().
	4. at the same time the second thread does fchdir(), moving to
	   different directory.

  In fchdir(2) we get to set_fs_pwd(), which set the current directory
  to the new place and does mntput() on the old one.  No RCU delays here,
  we calculate the refcount of that mount and see that we are dropping
  the last reference.  We make sure that the pathwalk in progress in
  the first thread will fail when it comes to legitimize_mnt() and do this
  (in mntput_no_expire())::

	init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
	if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
		return;

  As we leave the syscall, we have __cleanup_mnt() run; it calls
  cleanup_mnt() on our mount, which hits deactivate_super().  That was
  the last reference to superblock.

  Voila - we have a filesystem shutdown right under the nose of
  a thread running in ->d_hash() of something on that filesystem.
  Mutatis mutandis, one can arrange the same for other methods called
  by rcu pathwalk.

  It's not easy to hit (especially if you want to get through the
  entire ->kill_sb() before the first thread gets through ->d_hash()),
  and it's probably impossible on the real hardware; on KVM it might be
  borderline doable.  However, it is possible and I would not swear that
  other ways of arranging the same thing are equally hard to hit.

  The bottom line: methods that can be called in RCU mode need to be
  careful about the per-superblock objects destruction.

.. [#f1]

  Some callers prevent being called for dying dentry (holding ->d_lock
  and having verified !d_unhashed() or finding it in the list of inode's
  aliases under ->i_lock).  For those the scenario in question simply
  cannot arise.

  Some follow the match with lockref_get_not_dead() and treat the failure
  as mismatch.  That takes care of false positives, and false negatives
  on dying dentry are still correct - we simply pretend to have lost
  the race.

  The only caller that does not fit into the classes above is
  __d_lookup_rcu_op_compare().  There we sample ->d_seq and verify
  !d_unhashed() before calling ->d_compare().  That is not enough to
  prevent dentry from starting to die right under us; however, the
  sampled value of ->d_seq will be rechecked when the caller gets to
  step_into(), so for a false positive we will end up with a mismatch.
  The corner case around ->d_manage() is due to the handle_mounts()
  done before step_into() gets to ->d_seq validation...

.. [#f2]

  binding a symlink on top of a regular file on another filesystem is
  possible and that's all it takes for RCU pathwalk to get there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-28 13:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-25 22:06 [RFC] documentation on filesystem exposure to RCU pathwalk from fs maintainers' POV Al Viro
2024-02-26  6:35 ` Vegard Nossum
2024-02-26  9:13   ` Al Viro
2024-02-26  7:58 ` Randy Dunlap
2024-02-26 18:35 ` [RFC][v2] " Al Viro
2024-02-27  8:04   ` [RFC][v3] " Al Viro
2024-02-28  4:20     ` Akira Yokosawa
2024-02-28 13:02     ` [RFC][v4] " Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).