Re: [PATCH] Docs: fs: prove directory locking more formally

Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed

From: Mo Zou <lostzoumo@gmail.com>
To: viro@zeniv.linux.org.uk, brauner@kernel.org,
	Jan Kara <jack@suse.cz>,  "corbet@lwn.net" <corbet@lwn.net>
Cc: linux-fsdevel@vger.kernel.org, dd_nirvana@sjtu.edu.cn,
	 mingkaidong@sjtu.edu.cn, haibochen@sjtu.edu.cn,
	linux-doc@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Docs: fs: prove directory locking more formally
Date: Mon, 6 May 2024 20:29:01 +0800	[thread overview]
Message-ID: <CAHfrynM7+7RqhrGPqg=+uT8rCUH+8hvTRZg9vGz_xcSr-3Xtbw@mail.gmail.com> (raw)
In-Reply-To: <CAHfrynPTuNDH7J4=mVcya7nVMg-0ovkwewciTXbS-hSX7G1cAw@mail.gmail.com>

> > Instead of proof by contradiction, this proof more intuitively explains
> > the order that the locking rules want to enforce. If the developers want
> > to update the locking rules, they would be forced to come up with the
> > corresponding order that justifies deadlock-freedom, otherwise, the
> > proof won't pass. As a evidence of effectiveness, the authors of RefFS
> > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved
> > directories") through this formal approach.
>
> Any comments on this proof? BTW, it would be okay to keep both the
> by-contradiction proof and by-order proof as both proofs may find their
> audience.

Continuously seeking for comments.

Mo Zou <lostzoumo@gmail.com> 于2024年4月30日周二 09:44写道：
>
> > Instead of proof by contradiction, this proof more intuitively explains
> > the order that the locking rules want to enforce. If the developers want
> > to update the locking rules, they would be forced to come up with the
> > corresponding order that justifies deadlock-freedom, otherwise, the
> > proof won't pass. As a evidence of effectiveness, the authors of RefFS
> > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved
> > directories") through this formal approach.
>
> Any comments on this proof? BTW, it would be okay to keep both the
> by-contradiction proof and by-order proof as both proofs may find their
> audience.
>
> Mo Zou <lostzoumo@gmail.com> 于2024年4月13日周六 00:10写道：
> >
> > The directory locking proof should and could be made more formal.
> > Specifically, the locking rules for rename are very intricate and
> > subtle. The current proof shows deadlock-freedom by contradiction, i.e.,
> > if a deadlock is possible, where could go wrong. There is no denying
> > that the proof is detailed with all possible cases considered. However,
> > it gives less intuition on why the locking rules are correct and may be
> > hard to maintain. Developers and even experts may still make mistakes
> > without realizing that the proof is no longer correct. See commit
> > 28eceeda130f ("fs: Lock moved directories") for such a case.
> >
> > A recent academia paper RefFS from OSDI 2024 has proposed a formal way
> > to prove the correctness of locking scheme with machine-checkable proof
> > (see https://ipads.se.sjtu.edu.cn/projects/reffs for a preprint of the
> > paper). The core idea is to formally define the locking order of the
> > rules. Actually, the existing proof has the same idea but fails to
> > define the internal ranking for directories. The difficulty is that the
> > ranking for directories is not static but dynamic. In this patch, we
> > follow the idea from the paper to dynamically define the internal
> > ranking of directories and shows that operations always take directories
> > in order of increasing rank so that deadlocks are never possible.
> > Instead of proof by contradiction, this proof more intuitively explains
> > the order that the locking rules want to enforce. If the developers want
> > to update the locking rules, they would be forced to come up with the
> > corresponding order that justifies deadlock-freedom, otherwise, the
> > proof won't pass. As a evidence of effectiveness, the authors of RefFS
> > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved
> > directories") through this formal approach.
> >
> > Signed-off-by: Mo Zou <lostzoumo@gmail.com>
> > ---
> >  .../filesystems/directory-locking.rst         | 224 +++++++++---------
> >  1 file changed, 113 insertions(+), 111 deletions(-)
> >
> > diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
> > index 05ea387bc9fb..94e358d71986 100644
> > --- a/Documentation/filesystems/directory-locking.rst
> > +++ b/Documentation/filesystems/directory-locking.rst
> > @@ -44,7 +44,7 @@ For our purposes all operations fall in 6 classes:
> >         * decide which of the source and target need to be locked.
> >           The source needs to be locked if it's a non-directory, target - if it's
> >           a non-directory or about to be removed.
> > -       * take the locks that need to be taken (exlusive), in inode pointer order
> > +       * take the locks that need to be taken (exclusive), in inode pointer order
> >           if need to take both (that can happen only when both source and target
> >           are non-directories - the source because it wouldn't need to be locked
> >           otherwise and the target because mixing directory and non-directory is
> > @@ -135,12 +135,13 @@ If no directory is its own ancestor, the scheme above is deadlock-free.
> >  Proof:
> >
> >  There is a ranking on the locks, such that all primitives take
> > -them in order of non-decreasing rank.  Namely,
> > +them in order of increasing rank.  Namely,
> >
> >    * rank ->i_rwsem of non-directories on given filesystem in inode pointer
> >      order.
> > -  * put ->i_rwsem of all directories on a filesystem at the same rank,
> > -    lower than ->i_rwsem of any non-directory on the same filesystem.
> > +  * put ->i_rwsem of all directories on a filesystem lower than ->i_rwsem
> > +    of any non-directory on the same filesystem and the internal ranking
> > +    of directories are defined later.
> >    * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem
> >      on the same filesystem.
> >    * among the locks on different filesystems use the relative
> > @@ -149,92 +150,120 @@ them in order of non-decreasing rank.  Namely,
> >  For example, if we have NFS filesystem caching on a local one, we have
> >
> >    1. ->s_vfs_rename_mutex of NFS filesystem
> > -  2. ->i_rwsem of directories on that NFS filesystem, same rank for all
> > +  2. ->i_rwsem of directories on that NFS filesystem, internal ranking
> > +     defined later
> >    3. ->i_rwsem of non-directories on that filesystem, in order of
> >       increasing address of inode
> >    4. ->s_vfs_rename_mutex of local filesystem
> > -  5. ->i_rwsem of directories on the local filesystem, same rank for all
> > +  5. ->i_rwsem of directories on the local filesystem, internal ranking
> > +     defined later
> >    6. ->i_rwsem of non-directories on local filesystem, in order of
> >       increasing address of inode.
> >
> > -It's easy to verify that operations never take a lock with rank
> > -lower than that of an already held lock.
> > -
> > -Suppose deadlocks are possible.  Consider the minimal deadlocked
> > -set of threads.  It is a cycle of several threads, each blocked on a lock
> > -held by the next thread in the cycle.
> > -
> > -Since the locking order is consistent with the ranking, all
> > -contended locks in the minimal deadlock will be of the same rank,
> > -i.e. they all will be ->i_rwsem of directories on the same filesystem.
> > -Moreover, without loss of generality we can assume that all operations
> > -are done directly to that filesystem and none of them has actually
> > -reached the method call.
> > -
> > -In other words, we have a cycle of threads, T1,..., Tn,
> > -and the same number of directories (D1,...,Dn) such that
> > -
> > -       T1 is blocked on D1 which is held by T2
> > -
> > -       T2 is blocked on D2 which is held by T3
> > -
> > -       ...
> > -
> > -       Tn is blocked on Dn which is held by T1.
> > -
> > -Each operation in the minimal cycle must have locked at least
> > -one directory and blocked on attempt to lock another.  That leaves
> > -only 3 possible operations: directory removal (locks parent, then
> > -child), same-directory rename killing a subdirectory (ditto) and
> > -cross-directory rename of some sort.
> > -
> > -There must be a cross-directory rename in the set; indeed,
> > -if all operations had been of the "lock parent, then child" sort
> > -we would have Dn a parent of D1, which is a parent of D2, which is
> > -a parent of D3, ..., which is a parent of Dn.  Relationships couldn't
> > -have changed since the moment directory locks had been acquired,
> > -so they would all hold simultaneously at the deadlock time and
> > -we would have a loop.
> > -
> > -Since all operations are on the same filesystem, there can't be
> > -more than one cross-directory rename among them.  Without loss of
> > -generality we can assume that T1 is the one doing a cross-directory
> > -rename and everything else is of the "lock parent, then child" sort.
> > -
> > -In other words, we have a cross-directory rename that locked
> > -Dn and blocked on attempt to lock D1, which is a parent of D2, which is
> > -a parent of D3, ..., which is a parent of Dn.  Relationships between
> > -D1,...,Dn all hold simultaneously at the deadlock time.  Moreover,
> > -cross-directory rename does not get to locking any directories until it
> > -has acquired filesystem lock and verified that directories involved have
> > -a common ancestor, which guarantees that ancestry relationships between
> > -all of them had been stable.
> > -
> > -Consider the order in which directories are locked by the
> > -cross-directory rename; parents first, then possibly their children.
> > -Dn and D1 would have to be among those, with Dn locked before D1.
> > -Which pair could it be?
> > -
> > -It can't be the parents - indeed, since D1 is an ancestor of Dn,
> > -it would be the first parent to be locked.  Therefore at least one of the
> > -children must be involved and thus neither of them could be a descendent
> > -of another - otherwise the operation would not have progressed past
> > -locking the parents.
> > -
> > -It can't be a parent and its child; otherwise we would've had
> > -a loop, since the parents are locked before the children, so the parent
> > -would have to be a descendent of its child.
> > -
> > -It can't be a parent and a child of another parent either.
> > -Otherwise the child of the parent in question would've been a descendent
> > -of another child.
> > -
> > -That leaves only one possibility - namely, both Dn and D1 are
> > -among the children, in some order.  But that is also impossible, since
> > -neither of the children is a descendent of another.
> > -
> > -That concludes the proof, since the set of operations with the
> > -properties requiered for a minimal deadlock can not exist.
> > +.. _RefFS: https://ipads.se.sjtu.edu.cn/projects/reffs
> > +
> > +The ranking above is mostly static except the internal ranking of
> > +directories. Below we will focus on internal ranking of directories on
> > +the same filesystem and present a way to dynamically define ranks of
> > +directories such that operations still take locks in order of
> > +increasing rank, i.e., operations always take a lock with rank higher
> > +than that of any already held locks. Note that the idea of dynamic
> > +ranking has a formal basis and more details can be found in the
> > +academia paper RefFS_ from OSDI 2024.
> > +
> > +
> > +First, without considering cross-directory renames, a directory's rank
> > +is defined as its distance from the root, i.e., root directory has rank
> > +0 and each child directory's rank equals one plus its parent
> > +directory's rank. Because no directory is its own ancestor, every
> > +directory's rank can be uniquely determined starting from root to its
> > +descendants. This ranking accounts for the (locks parent, then child)
> > +nested locking in directory removal and same-directory rename.
> > +
> > +Then, consider the most complex case in a cross-directory rename, where the
> > +most locks need to be acquired. The rename may perform the following
> > +nested locking (only show lock statements for convenience). Here, dir1
> > +and dir2 are the two parents and according to the locking rules, we
> > +know dir2 is not ancestor to dir1. child1 and child2 are source and
> > +target in a order according to the rules above and may not necessarily
> > +correspond to the child of dir1 and dir2. Ignore things in angle
> > +brackets for now. The locking order here can not be predetermined. For
> > +instance, rename(/a/x,/b/y) and rename(/b/y,/a/x) would acquire /a and
> > +/b in different order because neither parent is an ancestor of the
> > +other and we lock the parent of source first. But the observation is
> > +that after acquiring the filesystem lock, the precise locking order
> > +becomes known. We introduce a ghost state edge to help define the
> > +ranks (ghost state is a commonly used technique in verification which
> > +has no influence on the program execution and its only role is to
> > +assist the proof).
> > +
> > +  1. lock filesystem (->s_vfs_rename_mutex);
> > +  2. lock dir1->i_rwsem; <set edge to (dir1, dir2)>
> > +  3. lock dir2->i_rwsem; <set edge to (non-parent, child1) if child1 is
> > +     a directory and non-parent is the one in dir1 and dir2 that is not
> > +     parent to child1, otherwise, set edge to None>
> > +  4. lock child1->i_rwsem; <set edge to (child1, child2) if chil1 and
> > +     child2 are directories, otherwise, set edge to None>
> > +  5. lock child2->i_rwsem; <set edge to None>
> > +
> > +The value of edge is either None or (inode1,inode2)
> > +representing a possible lock dependency from inode1 to inode2, i.e.,
> > +some thread may request lock of inode2 with lock of inode1 held. There
> > +is only ever one edge value in a filesystem with edge set to None when
> > +the s_vfs_rename_mutex is not held and edge set to a determined value
> > +when s_vfs_rename_mutex is held. When edge is None, the ranking rules
> > +are the same as before (root is 0 and child is one plus its parent).
> > +When edge is (inode1, inode2), only the ranking rule for inode2
> > +changes: inode2's rank is one plus the larger rank of inode2's parent
> > +and inode1. Intuitively, a directory's rank is the longest distance
> > +from root in the filesystem tree extended with edge.
> > +
> > +Now let's check the setting of edge in angle brackets in above code to
> > +see how it ensures cross-directory rename takes locks in increasing
> > +order in this case. After the lock statement at line 2, edge is set to
> > +(dir1,dir2). Because dir2 is not ancestor to dir1, the rank of dir1
> > +stays the same and the rank of dir2 (as well as every other
> > +directories) can be uniquely determined. So at line 3, dir2's rank is
> > +higher than dir1's. Before the code acquires child1 at line 4, it
> > +already holds dir1 and dir2. We know one of dir1 and dir2 is parent to
> > +child1, so we update edge to (the non-parent of child1, child1).
> > +Because child1 cannot be ancestor to dir1 or dir2, every directory's
> > +rank is still uniquely defined (for the same reason as above). So at
> > +line 4, we ensure child1 has higher rank than dir1 and dir2. Similarly,
> > +we update edge to (child1, child2) so that at line 5, child2 has higher
> > +rank than held locks (dir1, dir2 and child1). We reset edge to None
> > +whenever we have acquired all directory locks.
> > +
> > +In a less complex case of cross-directory rename, the code may execute
> > +a prefix of above lines and we always reset edge to None after having
> > +acquired all directory locks.
> > +
> > +The ranking of directories may change under following cases:
> > +
> > +  * directory removal/creation: a directory's rank is removed/added
> > +  * rename: the effect of rename is either (1) moving a subtree under a
> > +    node that is outside of the subtree (the directories in the subtree are
> > +    all recalculated based on the node whose rank is not influenced)
> > +    and potentially removing target (a directory's rank is removed) or
> > +    (2) in case of RENAME_EXCHANGE, two subtrees exchange their
> > +    positions. Because source is not a descendent of the target and
> > +    target is not a descendent of source, the parents of the two
> > +    subtrees cannot not be in these subtrees so the ranks of parents
> > +    are not influenced. The two subtrees' ranks are recalculated based
> > +    on their parents
> > +  * edge updates in a cross-directory rename: setting edge to
> > +    (inode1,inode2) or resetting edge to None modify the ranks of the
> > +    inode2 subtree
> > +
> > +Initially in a filesystem, no directory is its own ancestor and
> > +directories' ranks can be uniquely determined. All these changes
> > +preserve the fact that at every time, the ranking of every directory
> > +can be uniquely determined. Thus no directory is its own ancestor
> > +(because if a directory is its own ancestor, its rank cannot be
> > +determined) and we prove the premise. Plus the fact that operations
> > +always take locks in order of increasing rank (can be easily verified).
> > +We can conclude the locking rules above are deadlock-free.
> > +
> >
> >  Note that the check for having a common ancestor in cross-directory
> >  rename is crucial - without it a deadlock would be possible.  Indeed,
> > @@ -247,33 +276,6 @@ distant ancestor.  Add a bunch of rmdir() attempts on all directories
> >  in between (all of those would fail with -ENOTEMPTY, had they ever gotten
> >  the locks) and voila - we have a deadlock.
> >
> > -Loop avoidance
> > -==============
> > -
> > -These operations are guaranteed to avoid loop creation.  Indeed,
> > -the only operation that could introduce loops is cross-directory rename.
> > -Suppose after the operation there is a loop; since there hadn't been such
> > -loops before the operation, at least on of the nodes in that loop must've
> > -had its parent changed.  In other words, the loop must be passing through
> > -the source or, in case of exchange, possibly the target.
> > -
> > -Since the operation has succeeded, neither source nor target could have
> > -been ancestors of each other.  Therefore the chain of ancestors starting
> > -in the parent of source could not have passed through the target and
> > -vice versa.  On the other hand, the chain of ancestors of any node could
> > -not have passed through the node itself, or we would've had a loop before
> > -the operation.  But everything other than source and target has kept
> > -the parent after the operation, so the operation does not change the
> > -chains of ancestors of (ex-)parents of source and target.  In particular,
> > -those chains must end after a finite number of steps.
> > -
> > -Now consider the loop created by the operation.  It passes through either
> > -source or target; the next node in the loop would be the ex-parent of
> > -target or source resp.  After that the loop would follow the chain of
> > -ancestors of that parent.  But as we have just shown, that chain must
> > -end after a finite number of steps, which means that it can't be a part
> > -of any loop.  Q.E.D.
> > -
> >  While this locking scheme works for arbitrary DAGs, it relies on
> >  ability to check that directory is a descendent of another object.  Current
> >  implementation assumes that directory graph is a tree.  This assumption is
> > --
> > 2.30.1 (Apple Git-130)
> >

     prev parent reply	other threads:[~2024-05-06 12:29 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-12 16:10 [PATCH] Docs: fs: prove directory locking more formally Mo Zou
2024-04-30  1:44 ` Mo Zou
2024-05-06 12:29   ` Mo Zou [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHfrynM7+7RqhrGPqg=+uT8rCUH+8hvTRZg9vGz_xcSr-3Xtbw@mail.gmail.com' \
    --to=lostzoumo@gmail.com \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dd_nirvana@sjtu.edu.cn \
    --cc=haibochen@sjtu.edu.cn \
    --cc=jack@suse.cz \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingkaidong@sjtu.edu.cn \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).