All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
@ 2017-08-07 12:54 Alex Lyakas
  2017-08-07 15:16 ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Lyakas @ 2017-08-07 12:54 UTC (permalink / raw
  To: linux-xfs; +Cc: david, bfoster, Shyam Kaushik, Yair Hershko

The new attribute leaf buffer is not held locked across the transaction roll
between the shortform->leaf modification and the addition of the new entry.
As a result, the attribute buffer modification being made is not atomic
from an operational perspective.
Hence the AIL push can grab it in the transient state of "just created"
after the initial transaction is rolled, because the buffer has been 
released.
This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
treating this as in-memory corruption, and shutting down the filesystem.

More info at:
https://www.spinics.net/lists/linux-xfs/msg08778.html

This patch is based on kernel 3.18.19, which is a long-term (although EOL) 
kernel.
This is the reason it is marked as RFC.

Reproducing the issue is achieved by adding "msleep(1000)" after the 
xfs_trans_roll() call.
>From the limited testing, this patch seems to fix the issue.
Once/if the community approves this patch, we will do a formal testing.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
---
fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 353fb42..c0ced12 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -285,20 +285,21 @@ xfs_attr_set(

     xfs_trans_ijoin(args.trans, dp, 0);

     /*
      * If the attribute list is non-existent or a shortform list,
      * upgrade it to a single-leaf-block attribute list.
      */
     if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL ||
         (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS &&
          dp->i_d.di_anextents == 0)) {
+        xfs_buf_t *leaf_bp = NULL;

         /*
          * Build initial attribute list (if required).
          */
         if (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS)
             xfs_attr_shortform_create(&args);

         /*
          * Try to add the attr to the attribute list in
          * the inode.
@@ -328,48 +329,65 @@ xfs_attr_set(
             xfs_iunlock(dp, XFS_ILOCK_EXCL);

             return error ? error : err2;
         }

         /*
          * It won't fit in the shortform, transform to a leaf block.
          * GROT: another possible req'mt for a double-split btree op.
          */
         xfs_bmap_init(args.flist, args.firstblock);
-        error = xfs_attr_shortform_to_leaf(&args);
+        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
+        /* hold the leaf buffer locked, when "args.trans" transaction 
commits */
+        if (leaf_bp)
+            xfs_trans_bhold(args.trans, leaf_bp);
+
         if (!error) {
             error = xfs_bmap_finish(&args.trans, args.flist,
                         &committed);
         }
         if (error) {
             ASSERT(committed);
+            if (leaf_bp)
+                xfs_buf_relse(leaf_bp);
             args.trans = NULL;
             xfs_bmap_cancel(&flist);
             goto out;
         }

         /*
          * bmap_finish() may have committed the last trans and started
          * a new one.  We need the inode to be in all transactions.
+         * We also need to rejoin the leaf buffer to the new transaction
+         * and prevent it from being unlocked, before we commit it in 
xfs_trans_roll().
+         * If bmap_finish() did not commit, then we are in the same 
transaction,
+         * and no need to call xfs_trans_bhold() again.
          */
-        if (committed)
+        if (committed) {
             xfs_trans_ijoin(args.trans, dp, 0);
+            xfs_trans_bjoin(args.trans, leaf_bp);
+            xfs_trans_bhold(args.trans, leaf_bp);
+        }

         /*
          * Commit the leaf transformation.  We'll need another (linked)
          * transaction to add the new attribute to the leaf.
          */

         error = xfs_trans_roll(&args.trans, dp);
-        if (error)
+        if (error) {
+            xfs_buf_relse(leaf_bp);
             goto out;
+        }

+        /* rejoin the leaf buffer to the new transaction */
+        xfs_trans_bjoin(args.trans, leaf_bp);
     }

     if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
         error = xfs_attr_leaf_addname(&args);
     else
         error = xfs_attr_node_addname(&args);
     if (error)
         goto out;

     /*
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b7cd0a0..2e03c32 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -649,23 +649,24 @@ xfs_attr_shortform_getvalue(xfs_da_args_t *args)
         args->valuelen = sfe->valuelen;
         memcpy(args->value, &sfe->nameval[args->namelen],
                             args->valuelen);
         return -EEXIST;
     }
     return -ENOATTR;
}

/*
  * Convert from using the shortform to the leaf.
+ * Upon success, return the leaf buffer.
  */
int
-xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
+xfs_attr_shortform_to_leaf(xfs_da_args_t *args, xfs_buf_t **bpp)
{
     xfs_inode_t *dp;
     xfs_attr_shortform_t *sf;
     xfs_attr_sf_entry_t *sfe;
     xfs_da_args_t nargs;
     char *tmpbuffer;
     int error, i, size;
     xfs_dablk_t blkno;
     struct xfs_buf *bp;
     xfs_ifork_t *ifp;
@@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
         ASSERT(error == -ENOATTR);
         error = xfs_attr3_leaf_add(bp, &nargs);
         ASSERT(error != -ENOSPC);
         if (error)
             goto out;
         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
     }
     error = 0;

out:
+    if (error == 0)
+        *bpp = bp;
     kmem_free(tmpbuffer);
     return error;
}

/*
  * Check a leaf attribute block to see if all the entries would fit into
  * a shortform attribute list.
  */
int
xfs_attr_shortform_allfit(
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 4f3a60a..d58e9e4 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -41,21 +41,21 @@ typedef struct xfs_attr_inactive_list {
  * Function prototypes for the kernel.
  *========================================================================*/

/*
  * Internal routines when attribute fork size < XFS_LITINO(mp).
  */
void    xfs_attr_shortform_create(struct xfs_da_args *args);
void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
-int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
+int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, xfs_buf_t 
**bpp);
int    xfs_attr_shortform_remove(struct xfs_da_args *args);
int    xfs_attr_shortform_list(struct xfs_attr_list_context *context);
int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
int    xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
void    xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);

/*
  * Internal routines when attribute fork size == XFS_LBSIZE(mp).
  */
int    xfs_attr3_leaf_to_node(struct xfs_da_args *args);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 12:54 [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute Alex Lyakas
@ 2017-08-07 15:16 ` Brian Foster
  2017-08-07 16:37   ` Darrick J. Wong
  2017-08-07 17:00   ` Alex Lyakas
  0 siblings, 2 replies; 9+ messages in thread
From: Brian Foster @ 2017-08-07 15:16 UTC (permalink / raw
  To: Alex Lyakas; +Cc: linux-xfs, david, Shyam Kaushik, Yair Hershko

On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
> The new attribute leaf buffer is not held locked across the transaction roll
> between the shortform->leaf modification and the addition of the new entry.
> As a result, the attribute buffer modification being made is not atomic
> from an operational perspective.
> Hence the AIL push can grab it in the transient state of "just created"
> after the initial transaction is rolled, because the buffer has been
> released.
> This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
> treating this as in-memory corruption, and shutting down the filesystem.
> 
> More info at:
> https://www.spinics.net/lists/linux-xfs/msg08778.html
> 

FYI.. I'm not sure how appropriate it is to use URLs in commit log
descriptions. Perhaps somebody else can chime in on that.

> This patch is based on kernel 3.18.19, which is a long-term (although EOL)
> kernel.
> This is the reason it is marked as RFC.
> 

It's probably best to develop against the latest kernel, get the fix
right, then worry about v3.18 for a backport.

> Reproducing the issue is achieved by adding "msleep(1000)" after the
> xfs_trans_roll() call.
> From the limited testing, this patch seems to fix the issue.
> Once/if the community approves this patch, we will do a formal testing.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> ---

Note that your patch doesn't apply for me:

...
patching file fs/xfs/libxfs/xfs_attr.c
Hunk #1 FAILED at 285.
patch: **** malformed patch at line 70: commits */

Perhaps it has been damaged by your mailer? Otherwise, a few initial
comments...

> fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
> fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
> fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
> 3 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 353fb42..c0ced12 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -285,20 +285,21 @@ xfs_attr_set(
> 
>     xfs_trans_ijoin(args.trans, dp, 0);
> 
>     /*
>      * If the attribute list is non-existent or a shortform list,
>      * upgrade it to a single-leaf-block attribute list.
>      */
>     if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL ||
>         (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS &&
>          dp->i_d.di_anextents == 0)) {
> +        xfs_buf_t *leaf_bp = NULL;

Use struct xfs_buf here and throughout. We're attempting to remove uses
of the old typedefs wherever possible.

> 
>         /*
>          * Build initial attribute list (if required).
>          */
>         if (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS)
>             xfs_attr_shortform_create(&args);
> 
>         /*
>          * Try to add the attr to the attribute list in
>          * the inode.
> @@ -328,48 +329,65 @@ xfs_attr_set(
>             xfs_iunlock(dp, XFS_ILOCK_EXCL);
> 
>             return error ? error : err2;
>         }
> 
>         /*
>          * It won't fit in the shortform, transform to a leaf block.
>          * GROT: another possible req'mt for a double-split btree op.
>          */
>         xfs_bmap_init(args.flist, args.firstblock);
> -        error = xfs_attr_shortform_to_leaf(&args);
> +        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
> +        /* hold the leaf buffer locked, when "args.trans" transaction
> commits */
> +        if (leaf_bp)
> +            xfs_trans_bhold(args.trans, leaf_bp);

Indentation looks off here and throughout.

> +
>         if (!error) {
>             error = xfs_bmap_finish(&args.trans, args.flist,
>                         &committed);
>         }
>         if (error) {
>             ASSERT(committed);
> +            if (leaf_bp)
> +                xfs_buf_relse(leaf_bp);

Hmm, I don't think this is right. We don't want to release the buffer
while the transaction still has a reference. Perhaps we should move this
to the "out:" patch after the transaction cancel (and make sure the
pointer is reset at the appropriate place)..?

>             args.trans = NULL;
>             xfs_bmap_cancel(&flist);
>             goto out;
>         }
> 
>         /*
>          * bmap_finish() may have committed the last trans and started
>          * a new one.  We need the inode to be in all transactions.
> +         * We also need to rejoin the leaf buffer to the new transaction
> +         * and prevent it from being unlocked, before we commit it in
> xfs_trans_roll().
> +         * If bmap_finish() did not commit, then we are in the same
> transaction,
> +         * and no need to call xfs_trans_bhold() again.
>          */
> -        if (committed)
> +        if (committed) {
>             xfs_trans_ijoin(args.trans, dp, 0);
> +            xfs_trans_bjoin(args.trans, leaf_bp);
> +            xfs_trans_bhold(args.trans, leaf_bp);

I don't see why this is necessary. We still have the buffer held and
locked, yes?

> +        }
> 
>         /*
>          * Commit the leaf transformation.  We'll need another (linked)
>          * transaction to add the new attribute to the leaf.
>          */
> 
>         error = xfs_trans_roll(&args.trans, dp);
> -        if (error)
> +        if (error) {
> +            xfs_buf_relse(leaf_bp);

Same problem as above.

>             goto out;
> +        }
> 
> +        /* rejoin the leaf buffer to the new transaction */
> +        xfs_trans_bjoin(args.trans, leaf_bp);

This comment should point out that this allows a subsequent read to find
the buffer in the transaction (and avoid a deadlock).

>     }
> 
>     if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
>         error = xfs_attr_leaf_addname(&args);
>     else
>         error = xfs_attr_node_addname(&args);
>     if (error)
>         goto out;
> 
>     /*
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index b7cd0a0..2e03c32 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -649,23 +649,24 @@ xfs_attr_shortform_getvalue(xfs_da_args_t *args)
>         args->valuelen = sfe->valuelen;
>         memcpy(args->value, &sfe->nameval[args->namelen],
>                             args->valuelen);
>         return -EEXIST;
>     }
>     return -ENOATTR;
> }
> 
> /*
>  * Convert from using the shortform to the leaf.
> + * Upon success, return the leaf buffer.
>  */
> int
> -xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> +xfs_attr_shortform_to_leaf(xfs_da_args_t *args, xfs_buf_t **bpp)
> {
>     xfs_inode_t *dp;
>     xfs_attr_shortform_t *sf;
>     xfs_attr_sf_entry_t *sfe;
>     xfs_da_args_t nargs;
>     char *tmpbuffer;
>     int error, i, size;
>     xfs_dablk_t blkno;
>     struct xfs_buf *bp;
>     xfs_ifork_t *ifp;
> @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
>         ASSERT(error == -ENOATTR);
>         error = xfs_attr3_leaf_add(bp, &nargs);
>         ASSERT(error != -ENOSPC);
>         if (error)
>             goto out;
>         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
>     }
>     error = 0;
> 
> out:
> +    if (error == 0)
> +        *bpp = bp;

It looks like the only way error == 0 is where it is set just above, so
we could probably move this before the label.

I'm also wondering whether it might be cleaner to 1.) conditionally
return the buffer when sf->hdr.count == 0 because that is the only case
where the problem occurs and 2.) do the trans_bhold() here in
shortform_to_leaf() when we do actually return it.

Brian

>     kmem_free(tmpbuffer);
>     return error;
> }
> 
> /*
>  * Check a leaf attribute block to see if all the entries would fit into
>  * a shortform attribute list.
>  */
> int
> xfs_attr_shortform_allfit(
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> index 4f3a60a..d58e9e4 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> @@ -41,21 +41,21 @@ typedef struct xfs_attr_inactive_list {
>  * Function prototypes for the kernel.
>  *========================================================================*/
> 
> /*
>  * Internal routines when attribute fork size < XFS_LITINO(mp).
>  */
> void    xfs_attr_shortform_create(struct xfs_da_args *args);
> void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
> int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
> int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> -int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> +int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, xfs_buf_t
> **bpp);
> int    xfs_attr_shortform_remove(struct xfs_da_args *args);
> int    xfs_attr_shortform_list(struct xfs_attr_list_context *context);
> int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> int    xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> void    xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
> 
> /*
>  * Internal routines when attribute fork size == XFS_LBSIZE(mp).
>  */
> int    xfs_attr3_leaf_to_node(struct xfs_da_args *args);
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 15:16 ` Brian Foster
@ 2017-08-07 16:37   ` Darrick J. Wong
  2017-08-07 17:11     ` Alex Lyakas
  2017-08-07 17:26     ` Brian Foster
  2017-08-07 17:00   ` Alex Lyakas
  1 sibling, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2017-08-07 16:37 UTC (permalink / raw
  To: Brian Foster; +Cc: Alex Lyakas, linux-xfs, david, Shyam Kaushik, Yair Hershko

On Mon, Aug 07, 2017 at 11:16:54AM -0400, Brian Foster wrote:
> On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
> > The new attribute leaf buffer is not held locked across the transaction roll
> > between the shortform->leaf modification and the addition of the new entry.
> > As a result, the attribute buffer modification being made is not atomic
> > from an operational perspective.
> > Hence the AIL push can grab it in the transient state of "just created"
> > after the initial transaction is rolled, because the buffer has been
> > released.
> > This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
> > treating this as in-memory corruption, and shutting down the filesystem.
> > 
> > More info at:
> > https://www.spinics.net/lists/linux-xfs/msg08778.html
> > 
> 
> FYI.. I'm not sure how appropriate it is to use URLs in commit log
> descriptions. Perhaps somebody else can chime in on that.

I prefer the changelog directly quote the relevant parts of the message
in case spinics should end up getting knocked offline like gmane, though
you can certainly use the url to cite the quotation.

> > This patch is based on kernel 3.18.19, which is a long-term (although EOL)
> > kernel.
> > This is the reason it is marked as RFC.
> > 
> 
> It's probably best to develop against the latest kernel, get the fix
> right, then worry about v3.18 for a backport.

Yes, we've renamed a few things since then...

> > Reproducing the issue is achieved by adding "msleep(1000)" after the
> > xfs_trans_roll() call.
> > From the limited testing, this patch seems to fix the issue.
> > Once/if the community approves this patch, we will do a formal testing.

Do you have a reproducer script that could be turned into an xfstest to
prevent future regressions?  We could add an error injection site at
that point that would force an ail push to simulate the problem.

> > Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> > ---
> 
> Note that your patch doesn't apply for me:
> 
> ...
> patching file fs/xfs/libxfs/xfs_attr.c
> Hunk #1 FAILED at 285.
> patch: **** malformed patch at line 70: commits */
> 
> Perhaps it has been damaged by your mailer? Otherwise, a few initial
> comments...

Yes, I also see severe whitespace damage.

> > fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
> > fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
> > fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
> > 3 files changed, 26 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > index 353fb42..c0ced12 100644
> > --- a/fs/xfs/libxfs/xfs_attr.c
> > +++ b/fs/xfs/libxfs/xfs_attr.c
> > @@ -285,20 +285,21 @@ xfs_attr_set(
> > 
> >     xfs_trans_ijoin(args.trans, dp, 0);
> > 
> >     /*
> >      * If the attribute list is non-existent or a shortform list,
> >      * upgrade it to a single-leaf-block attribute list.
> >      */
> >     if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL ||
> >         (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS &&
> >          dp->i_d.di_anextents == 0)) {
> > +        xfs_buf_t *leaf_bp = NULL;
> 
> Use struct xfs_buf here and throughout. We're attempting to remove uses
> of the old typedefs wherever possible.
> 
> > 
> >         /*
> >          * Build initial attribute list (if required).
> >          */
> >         if (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS)
> >             xfs_attr_shortform_create(&args);
> > 
> >         /*
> >          * Try to add the attr to the attribute list in
> >          * the inode.
> > @@ -328,48 +329,65 @@ xfs_attr_set(
> >             xfs_iunlock(dp, XFS_ILOCK_EXCL);
> > 
> >             return error ? error : err2;
> >         }
> > 
> >         /*
> >          * It won't fit in the shortform, transform to a leaf block.
> >          * GROT: another possible req'mt for a double-split btree op.
> >          */
> >         xfs_bmap_init(args.flist, args.firstblock);
> > -        error = xfs_attr_shortform_to_leaf(&args);
> > +        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
> > +        /* hold the leaf buffer locked, when "args.trans" transaction
> > commits */
> > +        if (leaf_bp)
> > +            xfs_trans_bhold(args.trans, leaf_bp);
> 
> Indentation looks off here and throughout.
> 
> > +
> >         if (!error) {
> >             error = xfs_bmap_finish(&args.trans, args.flist,
> >                         &committed);

(FYI this is now xfs_defer_finish...)

> >         }
> >         if (error) {
> >             ASSERT(committed);
> > +            if (leaf_bp)
> > +                xfs_buf_relse(leaf_bp);
> 
> Hmm, I don't think this is right. We don't want to release the buffer
> while the transaction still has a reference. Perhaps we should move this
> to the "out:" patch after the transaction cancel (and make sure the
> pointer is reset at the appropriate place)..?

I would've thought that the buffer would be relse'd by xfs_trans_cancel,
having been bjoined to the first args.trans and bheld to any subsequent
transactions?

> 
> >             args.trans = NULL;
> >             xfs_bmap_cancel(&flist);
> >             goto out;
> >         }
> > 
> >         /*
> >          * bmap_finish() may have committed the last trans and started
> >          * a new one.  We need the inode to be in all transactions.
> > +         * We also need to rejoin the leaf buffer to the new transaction
> > +         * and prevent it from being unlocked, before we commit it in
> > xfs_trans_roll().
> > +         * If bmap_finish() did not commit, then we are in the same
> > transaction,
> > +         * and no need to call xfs_trans_bhold() again.
> >          */
> > -        if (committed)
> > +        if (committed) {
> >             xfs_trans_ijoin(args.trans, dp, 0);
> > +            xfs_trans_bjoin(args.trans, leaf_bp);
> > +            xfs_trans_bhold(args.trans, leaf_bp);
> 
> I don't see why this is necessary. We still have the buffer held and
> locked, yes?

The bhold means that after the transaction roll leaf_bp is still locked
but not associated with a transaction, so it is necessary to bjoin
leaf_bp to the new transaction.  Since the very next thing we do is roll
the transaction again, we also want to bhold it for the second roll.

(Question: is the second roll even necessary for this case?)

> > +        }
> > 
> >         /*
> >          * Commit the leaf transformation.  We'll need another (linked)
> >          * transaction to add the new attribute to the leaf.
> >          */
> > 
> >         error = xfs_trans_roll(&args.trans, dp);
> > -        if (error)
> > +        if (error) {
> > +            xfs_buf_relse(leaf_bp);
> 
> Same problem as above.
> 
> >             goto out;
> > +        }
> > 
> > +        /* rejoin the leaf buffer to the new transaction */
> > +        xfs_trans_bjoin(args.trans, leaf_bp);
> 
> This comment should point out that this allows a subsequent read to find
> the buffer in the transaction (and avoid a deadlock).
> 
> >     }
> > 
> >     if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
> >         error = xfs_attr_leaf_addname(&args);
> >     else
> >         error = xfs_attr_node_addname(&args);
> >     if (error)
> >         goto out;
> > 
> >     /*
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > index b7cd0a0..2e03c32 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> > @@ -649,23 +649,24 @@ xfs_attr_shortform_getvalue(xfs_da_args_t *args)
> >         args->valuelen = sfe->valuelen;
> >         memcpy(args->value, &sfe->nameval[args->namelen],
> >                             args->valuelen);
> >         return -EEXIST;
> >     }
> >     return -ENOATTR;
> > }
> > 
> > /*
> >  * Convert from using the shortform to the leaf.
> > + * Upon success, return the leaf buffer.
> >  */
> > int
> > -xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> > +xfs_attr_shortform_to_leaf(xfs_da_args_t *args, xfs_buf_t **bpp)
> > {
> >     xfs_inode_t *dp;
> >     xfs_attr_shortform_t *sf;
> >     xfs_attr_sf_entry_t *sfe;
> >     xfs_da_args_t nargs;
> >     char *tmpbuffer;
> >     int error, i, size;
> >     xfs_dablk_t blkno;
> >     struct xfs_buf *bp;
> >     xfs_ifork_t *ifp;
> > @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> >         ASSERT(error == -ENOATTR);
> >         error = xfs_attr3_leaf_add(bp, &nargs);
> >         ASSERT(error != -ENOSPC);
> >         if (error)
> >             goto out;
> >         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
> >     }
> >     error = 0;
> > 
> > out:
> > +    if (error == 0)
> > +        *bpp = bp;
> 
> It looks like the only way error == 0 is where it is set just above, so
> we could probably move this before the label.
> 
> I'm also wondering whether it might be cleaner to 1.) conditionally
> return the buffer when sf->hdr.count == 0 because that is the only case
> where the problem occurs and 2.) do the trans_bhold() here in
> shortform_to_leaf() when we do actually return it.

I /think/ you could get rid of that second roll and do the bhold in
shortform_to_leaf, though you'd still have to re-bjoin it after
defer_finish.

--D

> 
> Brian
> 
> >     kmem_free(tmpbuffer);
> >     return error;
> > }
> > 
> > /*
> >  * Check a leaf attribute block to see if all the entries would fit into
> >  * a shortform attribute list.
> >  */
> > int
> > xfs_attr_shortform_allfit(
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> > index 4f3a60a..d58e9e4 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> > @@ -41,21 +41,21 @@ typedef struct xfs_attr_inactive_list {
> >  * Function prototypes for the kernel.
> >  *========================================================================*/
> > 
> > /*
> >  * Internal routines when attribute fork size < XFS_LITINO(mp).
> >  */
> > void    xfs_attr_shortform_create(struct xfs_da_args *args);
> > void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
> > int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
> > int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> > -int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> > +int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, xfs_buf_t
> > **bpp);
> > int    xfs_attr_shortform_remove(struct xfs_da_args *args);
> > int    xfs_attr_shortform_list(struct xfs_attr_list_context *context);
> > int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> > int    xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> > void    xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
> > 
> > /*
> >  * Internal routines when attribute fork size == XFS_LBSIZE(mp).
> >  */
> > int    xfs_attr3_leaf_to_node(struct xfs_da_args *args);
> > -- 
> > 1.9.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 15:16 ` Brian Foster
  2017-08-07 16:37   ` Darrick J. Wong
@ 2017-08-07 17:00   ` Alex Lyakas
  2017-08-07 17:59     ` Brian Foster
  1 sibling, 1 reply; 9+ messages in thread
From: Alex Lyakas @ 2017-08-07 17:00 UTC (permalink / raw
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner, Shyam Kaushik, Yair Hershko

Hello Brian,

Thanks for the review.

On Mon, Aug 7, 2017 at 6:16 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
>> The new attribute leaf buffer is not held locked across the transaction roll
>> between the shortform->leaf modification and the addition of the new entry.
>> As a result, the attribute buffer modification being made is not atomic
>> from an operational perspective.
>> Hence the AIL push can grab it in the transient state of "just created"
>> after the initial transaction is rolled, because the buffer has been
>> released.
>> This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
>> treating this as in-memory corruption, and shutting down the filesystem.
>>
>> More info at:
>> https://www.spinics.net/lists/linux-xfs/msg08778.html
>>
>
> FYI.. I'm not sure how appropriate it is to use URLs in commit log
> descriptions. Perhaps somebody else can chime in on that.
I will get rid of it.

>
>> This patch is based on kernel 3.18.19, which is a long-term (although EOL)
>> kernel.
>> This is the reason it is marked as RFC.
>>
>
> It's probably best to develop against the latest kernel, get the fix
> right, then worry about v3.18 for a backport.
Unfortunately, for us this is exactly the opposite. We are running
kernel 3.18 in production, and that's where we need the fix. Moving to
a different kernel is a significant effort for us. We do it from time
to time, but it's always to a long-term stable kernel and not to one
of the latests. Still, below I provide a patch for 4.13-rc4.

>
>> Reproducing the issue is achieved by adding "msleep(1000)" after the
>> xfs_trans_roll() call.
>> From the limited testing, this patch seems to fix the issue.
>> Once/if the community approves this patch, we will do a formal testing.
>>
>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> ---
>
> Note that your patch doesn't apply for me:
>
> ...
> patching file fs/xfs/libxfs/xfs_attr.c
> Hunk #1 FAILED at 285.
> patch: **** malformed patch at line 70: commits */
>
> Perhaps it has been damaged by your mailer? Otherwise, a few initial
> comments...
Trying a different mailer now.

>
>> fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
>> fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
>> fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
>> 3 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
>> index 353fb42..c0ced12 100644
>> --- a/fs/xfs/libxfs/xfs_attr.c
>> +++ b/fs/xfs/libxfs/xfs_attr.c
>> @@ -285,20 +285,21 @@ xfs_attr_set(
>>
>>     xfs_trans_ijoin(args.trans, dp, 0);
>>
>>     /*
>>      * If the attribute list is non-existent or a shortform list,
>>      * upgrade it to a single-leaf-block attribute list.
>>      */
>>     if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL ||
>>         (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS &&
>>          dp->i_d.di_anextents == 0)) {
>> +        xfs_buf_t *leaf_bp = NULL;
>
> Use struct xfs_buf here and throughout. We're attempting to remove uses
> of the old typedefs wherever possible.
Noted.

>
>>
>>         /*
>>          * Build initial attribute list (if required).
>>          */
>>         if (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS)
>>             xfs_attr_shortform_create(&args);
>>
>>         /*
>>          * Try to add the attr to the attribute list in
>>          * the inode.
>> @@ -328,48 +329,65 @@ xfs_attr_set(
>>             xfs_iunlock(dp, XFS_ILOCK_EXCL);
>>
>>             return error ? error : err2;
>>         }
>>
>>         /*
>>          * It won't fit in the shortform, transform to a leaf block.
>>          * GROT: another possible req'mt for a double-split btree op.
>>          */
>>         xfs_bmap_init(args.flist, args.firstblock);
>> -        error = xfs_attr_shortform_to_leaf(&args);
>> +        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
>> +        /* hold the leaf buffer locked, when "args.trans" transaction
>> commits */
>> +        if (leaf_bp)
>> +            xfs_trans_bhold(args.trans, leaf_bp);
>
> Indentation looks off here and throughout.
>
>> +
>>         if (!error) {
>>             error = xfs_bmap_finish(&args.trans, args.flist,
>>                         &committed);
>>         }
>>         if (error) {
>>             ASSERT(committed);
>> +            if (leaf_bp)
>> +                xfs_buf_relse(leaf_bp);
>
> Hmm, I don't think this is right. We don't want to release the buffer
> while the transaction still has a reference. Perhaps we should move this
> to the "out:" patch after the transaction cancel (and make sure the
> pointer is reset at the appropriate place)..?
I think I understand what you are saying. After we call
xfs_trans_bhold(), we need first the original transaction to commit or
to abort. Then we must either rejoin the buffer to a different
transaction or release it. I believe the below patch addresses this
issue.

>
>>             args.trans = NULL;
>>             xfs_bmap_cancel(&flist);
>>             goto out;
>>         }
>>
>>         /*
>>          * bmap_finish() may have committed the last trans and started
>>          * a new one.  We need the inode to be in all transactions.
>> +         * We also need to rejoin the leaf buffer to the new transaction
>> +         * and prevent it from being unlocked, before we commit it in
>> xfs_trans_roll().
>> +         * If bmap_finish() did not commit, then we are in the same
>> transaction,
>> +         * and no need to call xfs_trans_bhold() again.
>>          */
>> -        if (committed)
>> +        if (committed) {
>>             xfs_trans_ijoin(args.trans, dp, 0);
>> +            xfs_trans_bjoin(args.trans, leaf_bp);
>> +            xfs_trans_bhold(args.trans, leaf_bp);
>
> I don't see why this is necessary. We still have the buffer held and
> locked, yes?
I believe you are right. We don't care if there was another
transaction in-between. Addressed this.

>
>> +        }
>>
>>         /*
>>          * Commit the leaf transformation.  We'll need another (linked)
>>          * transaction to add the new attribute to the leaf.
>>          */
>>
>>         error = xfs_trans_roll(&args.trans, dp);
>> -        if (error)
>> +        if (error) {
>> +            xfs_buf_relse(leaf_bp);
>
> Same problem as above.
>
>>             goto out;
>> +        }
>>
>> +        /* rejoin the leaf buffer to the new transaction */
>> +        xfs_trans_bjoin(args.trans, leaf_bp);
>
> This comment should point out that this allows a subsequent read to find
> the buffer in the transaction (and avoid a deadlock).
Done.

>
>>     }
>>
>>     if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
>>         error = xfs_attr_leaf_addname(&args);
>>     else
>>         error = xfs_attr_node_addname(&args);
>>     if (error)
>>         goto out;
>>
>>     /*
>> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
>> index b7cd0a0..2e03c32 100644
>> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
>> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
>> @@ -649,23 +649,24 @@ xfs_attr_shortform_getvalue(xfs_da_args_t *args)
>>         args->valuelen = sfe->valuelen;
>>         memcpy(args->value, &sfe->nameval[args->namelen],
>>                             args->valuelen);
>>         return -EEXIST;
>>     }
>>     return -ENOATTR;
>> }
>>
>> /*
>>  * Convert from using the shortform to the leaf.
>> + * Upon success, return the leaf buffer.
>>  */
>> int
>> -xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
>> +xfs_attr_shortform_to_leaf(xfs_da_args_t *args, xfs_buf_t **bpp)
>> {
>>     xfs_inode_t *dp;
>>     xfs_attr_shortform_t *sf;
>>     xfs_attr_sf_entry_t *sfe;
>>     xfs_da_args_t nargs;
>>     char *tmpbuffer;
>>     int error, i, size;
>>     xfs_dablk_t blkno;
>>     struct xfs_buf *bp;
>>     xfs_ifork_t *ifp;
>> @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
>>         ASSERT(error == -ENOATTR);
>>         error = xfs_attr3_leaf_add(bp, &nargs);
>>         ASSERT(error != -ENOSPC);
>>         if (error)
>>             goto out;
>>         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
>>     }
>>     error = 0;
>>
>> out:
>> +    if (error == 0)
>> +        *bpp = bp;
>
> It looks like the only way error == 0 is where it is set just above, so
> we could probably move this before the label.
I am a bit confused by this code in xfs_attr_shortform_to_leaf():
    ...
    error = xfs_attr3_leaf_create(args, blkno, &bp);
    if (error) {
        error = xfs_da_shrink_inode(args, 0, bp);
        bp = NULL;
        if (error)
            goto out;
        xfs_idata_realloc(dp, size, XFS_ATTR_FORK);    /* try to put */
        memcpy(ifp->if_u1.if_data, tmpbuffer, size);    /* it back */
        goto out;
    }
Here we fail to convert to a leaf. But if xfs_da_shrink_inode()
succeeds, we return error==0 back to the caller. But we actually
failed to convert. This, however, is not directly related to your
comment.

>
> I'm also wondering whether it might be cleaner to 1.) conditionally
> return the buffer when sf->hdr.count == 0 because that is the only case
> where the problem occurs and 2.) do the trans_bhold() here in
> shortform_to_leaf() when we do actually return it.
>
> Brian
>
>>     kmem_free(tmpbuffer);
>>     return error;
>> }
>>

Here is the patch based on kernel 4.13-rc4.

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index de7b9bd..982e322 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -218,6 +218,7 @@
     xfs_fsblock_t        firstblock;
     int            rsvd = (flags & ATTR_ROOT) != 0;
     int            error, err2, local;
+    struct xfs_buf        *leaf_bp = NULL;

     XFS_STATS_INC(mp, xs_attr_set);

@@ -327,7 +328,13 @@
          * GROT: another possible req'mt for a double-split btree op.
          */
         xfs_defer_init(args.dfops, args.firstblock);
-        error = xfs_attr_shortform_to_leaf(&args);
+        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
+        /*
+         * Prevent the leaf buffer from being unlocked
+         * when "args.trans" transaction commits.
+         */
+        if (leaf_bp)
+            xfs_trans_bhold(args.trans, leaf_bp);
         if (!error)
             error = xfs_defer_finish(&args.trans, args.dfops, dp);
         if (error) {
@@ -345,6 +352,14 @@
         if (error)
             goto out;

+        /*
+         * Rejoin the leaf buffer to the new transaction.
+         * This allows a subsequent read to find the buffer in the
+         * transaction (and avoid a deadlock).
+         */
+        xfs_trans_bjoin(args.trans, leaf_bp);
+        /* Prevent from being released at the end of the function */
+        leaf_bp = NULL;
     }

     if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
@@ -376,6 +391,8 @@
 out:
     if (args.trans)
         xfs_trans_cancel(args.trans);
+    if (leaf_bp)
+        xfs_buf_relse(leaf_bp);
     xfs_iunlock(dp, XFS_ILOCK_EXCL);
     return error;
 }
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index c6c15e5..ab73e4b 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -740,9 +740,10 @@ STATIC void xfs_attr3_leaf_moveents(struct
xfs_da_args *args,

 /*
  * Convert from using the shortform to the leaf.
+ * Upon success, return the leaf buffer.
  */
 int
-xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
+xfs_attr_shortform_to_leaf(xfs_da_args_t *args, struct xfs_buf **bpp)
 {
     xfs_inode_t *dp;
     xfs_attr_shortform_t *sf;
@@ -822,6 +823,7 @@ STATIC void xfs_attr3_leaf_moveents(struct
xfs_da_args *args,
         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
     }
     error = 0;
+    *bpp = bp;

 out:
     kmem_free(tmpbuffer);
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index f7dda0c..7382d4e 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -48,7 +48,7 @@
 void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
 int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
 int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
-int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
+int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, struct
xfs_buf **bpp);
 int    xfs_attr_shortform_remove(struct xfs_da_args *args);
 int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
 int    xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 16:37   ` Darrick J. Wong
@ 2017-08-07 17:11     ` Alex Lyakas
  2017-08-07 17:26     ` Brian Foster
  1 sibling, 0 replies; 9+ messages in thread
From: Alex Lyakas @ 2017-08-07 17:11 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: Brian Foster, linux-xfs, Dave Chinner, Shyam Kaushik,
	Yair Hershko

Hi Darrick,


On Mon, Aug 7, 2017 at 7:37 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Mon, Aug 07, 2017 at 11:16:54AM -0400, Brian Foster wrote:
>> On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
>> > The new attribute leaf buffer is not held locked across the transaction roll
>> > between the shortform->leaf modification and the addition of the new entry.
>> > As a result, the attribute buffer modification being made is not atomic
>> > from an operational perspective.
>> > Hence the AIL push can grab it in the transient state of "just created"
>> > after the initial transaction is rolled, because the buffer has been
>> > released.
>> > This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
>> > treating this as in-memory corruption, and shutting down the filesystem.
>> >
>> > More info at:
>> > https://www.spinics.net/lists/linux-xfs/msg08778.html
>> >
>>
>> FYI.. I'm not sure how appropriate it is to use URLs in commit log
>> descriptions. Perhaps somebody else can chime in on that.
>
> I prefer the changelog directly quote the relevant parts of the message
> in case spinics should end up getting knocked offline like gmane, though
> you can certainly use the url to cite the quotation.
>
>> > This patch is based on kernel 3.18.19, which is a long-term (although EOL)
>> > kernel.
>> > This is the reason it is marked as RFC.
>> >
>>
>> It's probably best to develop against the latest kernel, get the fix
>> right, then worry about v3.18 for a backport.
>
> Yes, we've renamed a few things since then...
>
>> > Reproducing the issue is achieved by adding "msleep(1000)" after the
>> > xfs_trans_roll() call.
>> > From the limited testing, this patch seems to fix the issue.
>> > Once/if the community approves this patch, we will do a formal testing.
>
> Do you have a reproducer script that could be turned into an xfstest to
> prevent future regressions?  We could add an error injection site at
> that point that would force an ail push to simulate the problem.
>
I have a two-liner "script" that creates a file and sets an ACL with
length=100 bytes on it using setfattr. I run several of these in
multiple processes. But the problem reproduces only if I add
msleep(1000), after the call to xfs_trans_roll() before adding the
first attribute to the leaf. Then the problem reproduces easily.


>> > Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> > ---
>>
>> Note that your patch doesn't apply for me:
>>
>> ...
>> patching file fs/xfs/libxfs/xfs_attr.c
>> Hunk #1 FAILED at 285.
>> patch: **** malformed patch at line 70: commits */
>>
>> Perhaps it has been damaged by your mailer? Otherwise, a few initial
>> comments...
>
> Yes, I also see severe whitespace damage.
>
>> > fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
>> > fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
>> > fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
>> > 3 files changed, 26 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
>> > index 353fb42..c0ced12 100644
>> > --- a/fs/xfs/libxfs/xfs_attr.c
>> > +++ b/fs/xfs/libxfs/xfs_attr.c
>> > @@ -285,20 +285,21 @@ xfs_attr_set(
>> >
>> >     xfs_trans_ijoin(args.trans, dp, 0);
>> >
>> >     /*
>> >      * If the attribute list is non-existent or a shortform list,
>> >      * upgrade it to a single-leaf-block attribute list.
>> >      */
>> >     if (dp->i_d.di_aformat == XFS_DINODE_FMT_LOCAL ||
>> >         (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS &&
>> >          dp->i_d.di_anextents == 0)) {
>> > +        xfs_buf_t *leaf_bp = NULL;
>>
>> Use struct xfs_buf here and throughout. We're attempting to remove uses
>> of the old typedefs wherever possible.
>>
>> >
>> >         /*
>> >          * Build initial attribute list (if required).
>> >          */
>> >         if (dp->i_d.di_aformat == XFS_DINODE_FMT_EXTENTS)
>> >             xfs_attr_shortform_create(&args);
>> >
>> >         /*
>> >          * Try to add the attr to the attribute list in
>> >          * the inode.
>> > @@ -328,48 +329,65 @@ xfs_attr_set(
>> >             xfs_iunlock(dp, XFS_ILOCK_EXCL);
>> >
>> >             return error ? error : err2;
>> >         }
>> >
>> >         /*
>> >          * It won't fit in the shortform, transform to a leaf block.
>> >          * GROT: another possible req'mt for a double-split btree op.
>> >          */
>> >         xfs_bmap_init(args.flist, args.firstblock);
>> > -        error = xfs_attr_shortform_to_leaf(&args);
>> > +        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
>> > +        /* hold the leaf buffer locked, when "args.trans" transaction
>> > commits */
>> > +        if (leaf_bp)
>> > +            xfs_trans_bhold(args.trans, leaf_bp);
>>
>> Indentation looks off here and throughout.
>>
>> > +
>> >         if (!error) {
>> >             error = xfs_bmap_finish(&args.trans, args.flist,
>> >                         &committed);
>
> (FYI this is now xfs_defer_finish...)
>
>> >         }
>> >         if (error) {
>> >             ASSERT(committed);
>> > +            if (leaf_bp)
>> > +                xfs_buf_relse(leaf_bp);
>>
>> Hmm, I don't think this is right. We don't want to release the buffer
>> while the transaction still has a reference. Perhaps we should move this
>> to the "out:" patch after the transaction cancel (and make sure the
>> pointer is reset at the appropriate place)..?
>
> I would've thought that the buffer would be relse'd by xfs_trans_cancel,
> having been bjoined to the first args.trans and bheld to any subsequent
> transactions?
According to Brian's comment, and also reviewing the code, after
holding the buffer, we need first to let the transaction commit or
abort. Then we know that it is not touching the buffer anymore. At
this point, we must either re-join the buffer to a different
transaction, or explicitly release it. That's my understanding:)

>
>>
>> >             args.trans = NULL;
>> >             xfs_bmap_cancel(&flist);
>> >             goto out;
>> >         }
>> >
>> >         /*
>> >          * bmap_finish() may have committed the last trans and started
>> >          * a new one.  We need the inode to be in all transactions.
>> > +         * We also need to rejoin the leaf buffer to the new transaction
>> > +         * and prevent it from being unlocked, before we commit it in
>> > xfs_trans_roll().
>> > +         * If bmap_finish() did not commit, then we are in the same
>> > transaction,
>> > +         * and no need to call xfs_trans_bhold() again.
>> >          */
>> > -        if (committed)
>> > +        if (committed) {
>> >             xfs_trans_ijoin(args.trans, dp, 0);
>> > +            xfs_trans_bjoin(args.trans, leaf_bp);
>> > +            xfs_trans_bhold(args.trans, leaf_bp);
>>
>> I don't see why this is necessary. We still have the buffer held and
>> locked, yes?
>
> The bhold means that after the transaction roll leaf_bp is still locked
> but not associated with a transaction, so it is necessary to bjoin
> leaf_bp to the new transaction.  Since the very next thing we do is roll
> the transaction again, we also want to bhold it for the second roll.
>
I think that Brian is right, and we don't care if there was another
transaction rolled in-between. We can, of course, re-join it and bhold
it on an intermediate transaction (if any). This doesn't harm, but
seems unnecessary. My understanding again:)

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 16:37   ` Darrick J. Wong
  2017-08-07 17:11     ` Alex Lyakas
@ 2017-08-07 17:26     ` Brian Foster
  2017-08-07 18:10       ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Brian Foster @ 2017-08-07 17:26 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: Alex Lyakas, linux-xfs, david, Shyam Kaushik, Yair Hershko

On Mon, Aug 07, 2017 at 09:37:44AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 07, 2017 at 11:16:54AM -0400, Brian Foster wrote:
> > On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
> > > The new attribute leaf buffer is not held locked across the transaction roll
> > > between the shortform->leaf modification and the addition of the new entry.
> > > As a result, the attribute buffer modification being made is not atomic
> > > from an operational perspective.
> > > Hence the AIL push can grab it in the transient state of "just created"
> > > after the initial transaction is rolled, because the buffer has been
> > > released.
> > > This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
> > > treating this as in-memory corruption, and shutting down the filesystem.
> > > 
> > > More info at:
> > > https://www.spinics.net/lists/linux-xfs/msg08778.html
> > > 
...
> 
> > > fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
> > > fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
> > > fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
> > > 3 files changed, 26 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > > index 353fb42..c0ced12 100644
> > > --- a/fs/xfs/libxfs/xfs_attr.c
> > > +++ b/fs/xfs/libxfs/xfs_attr.c
...
> > > @@ -328,48 +329,65 @@ xfs_attr_set(
...
> > >         if (error) {
> > >             ASSERT(committed);
> > > +            if (leaf_bp)
> > > +                xfs_buf_relse(leaf_bp);
> > 
> > Hmm, I don't think this is right. We don't want to release the buffer
> > while the transaction still has a reference. Perhaps we should move this
> > to the "out:" patch after the transaction cancel (and make sure the
> > pointer is reset at the appropriate place)..?
> 
> I would've thought that the buffer would be relse'd by xfs_trans_cancel,
> having been bjoined to the first args.trans and bheld to any subsequent
> transactions?
> 

I'm not following what you mean. I guess if the transaction has been
committed by xfs_bmap_finish(), then the buffer is actually not in the
current transaction, but I think we need to release it either way (so
perhaps this code is fine).

> > 
> > >             args.trans = NULL;
> > >             xfs_bmap_cancel(&flist);
> > >             goto out;
> > >         }
> > > 
> > >         /*
> > >          * bmap_finish() may have committed the last trans and started
> > >          * a new one.  We need the inode to be in all transactions.
> > > +         * We also need to rejoin the leaf buffer to the new transaction
> > > +         * and prevent it from being unlocked, before we commit it in
> > > xfs_trans_roll().
> > > +         * If bmap_finish() did not commit, then we are in the same
> > > transaction,
> > > +         * and no need to call xfs_trans_bhold() again.
> > >          */
> > > -        if (committed)
> > > +        if (committed) {
> > >             xfs_trans_ijoin(args.trans, dp, 0);
> > > +            xfs_trans_bjoin(args.trans, leaf_bp);
> > > +            xfs_trans_bhold(args.trans, leaf_bp);
> > 
> > I don't see why this is necessary. We still have the buffer held and
> > locked, yes?
> 
> The bhold means that after the transaction roll leaf_bp is still locked
> but not associated with a transaction, so it is necessary to bjoin
> leaf_bp to the new transaction.  Since the very next thing we do is roll
> the transaction again, we also want to bhold it for the second roll.
> 

Ok, so we've created the empty block, logged it in the first
transaction, marked it held and committed the tx. At that point the
buffer can make it to the AIL, but we still have it locked (and held) so
it cannot be written back. What difference does it make whether the
buffer is added to this transaction? It won't be dirtied in the
transaction, it can't be written back either way and we eventually join
it to the transaction that creates the attr (after the roll).

> (Question: is the second roll even necessary for this case?)
> 

It's probably moot because I believe this code goes away in recent
kernels as the committed thing was cleaned out, probably before we
converted this over to xfs_defer_finish().

...
> > > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > > index b7cd0a0..2e03c32 100644
> > > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
...
> > > @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> > >         ASSERT(error == -ENOATTR);
> > >         error = xfs_attr3_leaf_add(bp, &nargs);
> > >         ASSERT(error != -ENOSPC);
> > >         if (error)
> > >             goto out;
> > >         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
> > >     }
> > >     error = 0;
> > > 
> > > out:
> > > +    if (error == 0)
> > > +        *bpp = bp;
> > 
> > It looks like the only way error == 0 is where it is set just above, so
> > we could probably move this before the label.
> > 
> > I'm also wondering whether it might be cleaner to 1.) conditionally
> > return the buffer when sf->hdr.count == 0 because that is the only case
> > where the problem occurs and 2.) do the trans_bhold() here in
> > shortform_to_leaf() when we do actually return it.
> 
> I /think/ you could get rid of that second roll and do the bhold in
> shortform_to_leaf, though you'd still have to re-bjoin it after
> defer_finish.
> 

Yeah, the semantic is basically to return a buffer iff the callee held
it in the tx because it was empty (maybe reneame leaf_bp to empty_bp or
something more clear?), thereby making the caller responsible to release
it. The caller happens to do that by joining it to the transaction that
will add the attr.

Brian

> --D
> 
> > 
> > Brian
> > 
> > >     kmem_free(tmpbuffer);
> > >     return error;
> > > }
> > > 
> > > /*
> > >  * Check a leaf attribute block to see if all the entries would fit into
> > >  * a shortform attribute list.
> > >  */
> > > int
> > > xfs_attr_shortform_allfit(
> > > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> > > index 4f3a60a..d58e9e4 100644
> > > --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> > > +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> > > @@ -41,21 +41,21 @@ typedef struct xfs_attr_inactive_list {
> > >  * Function prototypes for the kernel.
> > >  *========================================================================*/
> > > 
> > > /*
> > >  * Internal routines when attribute fork size < XFS_LITINO(mp).
> > >  */
> > > void    xfs_attr_shortform_create(struct xfs_da_args *args);
> > > void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
> > > int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
> > > int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> > > -int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> > > +int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, xfs_buf_t
> > > **bpp);
> > > int    xfs_attr_shortform_remove(struct xfs_da_args *args);
> > > int    xfs_attr_shortform_list(struct xfs_attr_list_context *context);
> > > int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> > > int    xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> > > void    xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
> > > 
> > > /*
> > >  * Internal routines when attribute fork size == XFS_LBSIZE(mp).
> > >  */
> > > int    xfs_attr3_leaf_to_node(struct xfs_da_args *args);
> > > -- 
> > > 1.9.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 17:00   ` Alex Lyakas
@ 2017-08-07 17:59     ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2017-08-07 17:59 UTC (permalink / raw
  To: Alex Lyakas; +Cc: linux-xfs, Dave Chinner, Shyam Kaushik, Yair Hershko

On Mon, Aug 07, 2017 at 08:00:46PM +0300, Alex Lyakas wrote:
> Hello Brian,
> 
> Thanks for the review.
> 
> On Mon, Aug 7, 2017 at 6:16 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
> >> The new attribute leaf buffer is not held locked across the transaction roll
> >> between the shortform->leaf modification and the addition of the new entry.
> >> As a result, the attribute buffer modification being made is not atomic
> >> from an operational perspective.
> >> Hence the AIL push can grab it in the transient state of "just created"
> >> after the initial transaction is rolled, because the buffer has been
> >> released.
> >> This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
> >> treating this as in-memory corruption, and shutting down the filesystem.
> >>
> >> More info at:
> >> https://www.spinics.net/lists/linux-xfs/msg08778.html
> >>
> >
> > FYI.. I'm not sure how appropriate it is to use URLs in commit log
> > descriptions. Perhaps somebody else can chime in on that.
> I will get rid of it.
> 
> >
> >> This patch is based on kernel 3.18.19, which is a long-term (although EOL)
> >> kernel.
> >> This is the reason it is marked as RFC.
> >>
> >
> > It's probably best to develop against the latest kernel, get the fix
> > right, then worry about v3.18 for a backport.
> Unfortunately, for us this is exactly the opposite. We are running
> kernel 3.18 in production, and that's where we need the fix. Moving to
> a different kernel is a significant effort for us. We do it from time
> to time, but it's always to a long-term stable kernel and not to one
> of the latests. Still, below I provide a patch for 4.13-rc4.
> 

This is the same as for most people. Nobody is running the latest
for-next kernel in a critical production environment. We debug/diagnose
issues against older kernels all the time, fix in the latest development
tree and backport from there as needed. Once the fix is backported and
picked up in a stable kernel, you can update your production systems at
that point.

It sounds like you have a reliable reproducer (albeit with injection of
an artificial delay), so you should be able to reproduce and test/verify
against a development kernel. This doesn't require moving any of your
infrastructure to the latest kernel (use a vm, for example).

> >
> >> Reproducing the issue is achieved by adding "msleep(1000)" after the
> >> xfs_trans_roll() call.
> >> From the limited testing, this patch seems to fix the issue.
> >> Once/if the community approves this patch, we will do a formal testing.
> >>
> >> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> >> ---
> >
> > Note that your patch doesn't apply for me:
> >
> > ...
> > patching file fs/xfs/libxfs/xfs_attr.c
> > Hunk #1 FAILED at 285.
> > patch: **** malformed patch at line 70: commits */
> >
> > Perhaps it has been damaged by your mailer? Otherwise, a few initial
> > comments...
> Trying a different mailer now.
> 

Same problem. It looks like everything is converted to spaces. Are you
copy+pasting the patch into an email? Note that usually will not work.
You can use something like 'git send-email' to post correctly formatted
patches over mail. Also note you can send it to yourself initially and
test apply it (and/or run scripts/checkpatch.pl against it) to make sure
it works.

> >
> >> fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
> >> fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
> >> fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
> >> 3 files changed, 26 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> >> index 353fb42..c0ced12 100644
> >> --- a/fs/xfs/libxfs/xfs_attr.c
> >> +++ b/fs/xfs/libxfs/xfs_attr.c
...
> >> @@ -328,48 +329,65 @@ xfs_attr_set(
...
> > Hmm, I don't think this is right. We don't want to release the buffer
> > while the transaction still has a reference. Perhaps we should move this
> > to the "out:" patch after the transaction cancel (and make sure the
> > pointer is reset at the appropriate place)..?
> I think I understand what you are saying. After we call
> xfs_trans_bhold(), we need first the original transaction to commit or
> to abort. Then we must either rejoin the buffer to a different
> transaction or release it. I believe the below patch addresses this
> issue.
> 

As noted in the last mail, it may not matter since the tx is committed
or aborted by the time an error is returned here. That said, I still
prefer the code you have below. :)

...
> >> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> >> index b7cd0a0..2e03c32 100644
> >> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> >> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> >> @@ -649,23 +649,24 @@ xfs_attr_shortform_getvalue(xfs_da_args_t *args)
> >>         args->valuelen = sfe->valuelen;
> >>         memcpy(args->value, &sfe->nameval[args->namelen],
> >>                             args->valuelen);
> >>         return -EEXIST;
> >>     }
> >>     return -ENOATTR;
> >> }
> >>
> >> /*
> >>  * Convert from using the shortform to the leaf.
> >> + * Upon success, return the leaf buffer.
> >>  */
> >> int
> >> -xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> >> +xfs_attr_shortform_to_leaf(xfs_da_args_t *args, xfs_buf_t **bpp)
> >> {
> >>     xfs_inode_t *dp;
> >>     xfs_attr_shortform_t *sf;
> >>     xfs_attr_sf_entry_t *sfe;
> >>     xfs_da_args_t nargs;
> >>     char *tmpbuffer;
> >>     int error, i, size;
> >>     xfs_dablk_t blkno;
> >>     struct xfs_buf *bp;
> >>     xfs_ifork_t *ifp;
> >> @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> >>         ASSERT(error == -ENOATTR);
> >>         error = xfs_attr3_leaf_add(bp, &nargs);
> >>         ASSERT(error != -ENOSPC);
> >>         if (error)
> >>             goto out;
> >>         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
> >>     }
> >>     error = 0;
> >>
> >> out:
> >> +    if (error == 0)
> >> +        *bpp = bp;
> >
> > It looks like the only way error == 0 is where it is set just above, so
> > we could probably move this before the label.
> I am a bit confused by this code in xfs_attr_shortform_to_leaf():
>     ...
>     error = xfs_attr3_leaf_create(args, blkno, &bp);
>     if (error) {
>         error = xfs_da_shrink_inode(args, 0, bp);
>         bp = NULL;
>         if (error)
>             goto out;
>         xfs_idata_realloc(dp, size, XFS_ATTR_FORK);    /* try to put */
>         memcpy(ifp->if_u1.if_data, tmpbuffer, size);    /* it back */
>         goto out;
>     }
> Here we fail to convert to a leaf. But if xfs_da_shrink_inode()
> succeeds, we return error==0 back to the caller. But we actually
> failed to convert. This, however, is not directly related to your
> comment.
> 

Hm, I'm not aware of any reason to intentionally clear the error in that
case. That may be a bug.

Brian

> >
> > I'm also wondering whether it might be cleaner to 1.) conditionally
> > return the buffer when sf->hdr.count == 0 because that is the only case
> > where the problem occurs and 2.) do the trans_bhold() here in
> > shortform_to_leaf() when we do actually return it.
> >
> > Brian
> >
> >>     kmem_free(tmpbuffer);
> >>     return error;
> >> }
> >>
> 
> Here is the patch based on kernel 4.13-rc4.
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index de7b9bd..982e322 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -218,6 +218,7 @@
>      xfs_fsblock_t        firstblock;
>      int            rsvd = (flags & ATTR_ROOT) != 0;
>      int            error, err2, local;
> +    struct xfs_buf        *leaf_bp = NULL;
> 
>      XFS_STATS_INC(mp, xs_attr_set);
> 
> @@ -327,7 +328,13 @@
>           * GROT: another possible req'mt for a double-split btree op.
>           */
>          xfs_defer_init(args.dfops, args.firstblock);
> -        error = xfs_attr_shortform_to_leaf(&args);
> +        error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
> +        /*
> +         * Prevent the leaf buffer from being unlocked
> +         * when "args.trans" transaction commits.
> +         */
> +        if (leaf_bp)
> +            xfs_trans_bhold(args.trans, leaf_bp);
>          if (!error)
>              error = xfs_defer_finish(&args.trans, args.dfops, dp);
>          if (error) {
> @@ -345,6 +352,14 @@
>          if (error)
>              goto out;
> 
> +        /*
> +         * Rejoin the leaf buffer to the new transaction.
> +         * This allows a subsequent read to find the buffer in the
> +         * transaction (and avoid a deadlock).
> +         */
> +        xfs_trans_bjoin(args.trans, leaf_bp);
> +        /* Prevent from being released at the end of the function */
> +        leaf_bp = NULL;
>      }
> 
>      if (xfs_bmap_one_block(dp, XFS_ATTR_FORK))
> @@ -376,6 +391,8 @@
>  out:
>      if (args.trans)
>          xfs_trans_cancel(args.trans);
> +    if (leaf_bp)
> +        xfs_buf_relse(leaf_bp);
>      xfs_iunlock(dp, XFS_ILOCK_EXCL);
>      return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index c6c15e5..ab73e4b 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -740,9 +740,10 @@ STATIC void xfs_attr3_leaf_moveents(struct
> xfs_da_args *args,
> 
>  /*
>   * Convert from using the shortform to the leaf.
> + * Upon success, return the leaf buffer.
>   */
>  int
> -xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> +xfs_attr_shortform_to_leaf(xfs_da_args_t *args, struct xfs_buf **bpp)
>  {
>      xfs_inode_t *dp;
>      xfs_attr_shortform_t *sf;
> @@ -822,6 +823,7 @@ STATIC void xfs_attr3_leaf_moveents(struct
> xfs_da_args *args,
>          sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
>      }
>      error = 0;
> +    *bpp = bp;
> 
>  out:
>      kmem_free(tmpbuffer);
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> index f7dda0c..7382d4e 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> @@ -48,7 +48,7 @@
>  void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
>  int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
>  int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> -int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> +int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, struct
> xfs_buf **bpp);
>  int    xfs_attr_shortform_remove(struct xfs_da_args *args);
>  int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
>  int    xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 17:26     ` Brian Foster
@ 2017-08-07 18:10       ` Darrick J. Wong
  2017-08-08  7:51         ` Alex Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-08-07 18:10 UTC (permalink / raw
  To: Brian Foster; +Cc: Alex Lyakas, linux-xfs, david, Shyam Kaushik, Yair Hershko

On Mon, Aug 07, 2017 at 01:26:09PM -0400, Brian Foster wrote:
> On Mon, Aug 07, 2017 at 09:37:44AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 07, 2017 at 11:16:54AM -0400, Brian Foster wrote:
> > > On Mon, Aug 07, 2017 at 03:54:03PM +0300, Alex Lyakas wrote:
> > > > The new attribute leaf buffer is not held locked across the transaction roll
> > > > between the shortform->leaf modification and the addition of the new entry.
> > > > As a result, the attribute buffer modification being made is not atomic
> > > > from an operational perspective.
> > > > Hence the AIL push can grab it in the transient state of "just created"
> > > > after the initial transaction is rolled, because the buffer has been
> > > > released.
> > > > This leads to xfs_attr3_leaf_verify() asserting that hdr.count is zero,
> > > > treating this as in-memory corruption, and shutting down the filesystem.
> > > > 
> > > > More info at:
> > > > https://www.spinics.net/lists/linux-xfs/msg08778.html
> > > > 
> ...
> > 
> > > > fs/xfs/libxfs/xfs_attr.c      | 24 +++++++++++++++++++++---
> > > > fs/xfs/libxfs/xfs_attr_leaf.c |  5 ++++-
> > > > fs/xfs/libxfs/xfs_attr_leaf.h |  2 +-
> > > > 3 files changed, 26 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > > > index 353fb42..c0ced12 100644
> > > > --- a/fs/xfs/libxfs/xfs_attr.c
> > > > +++ b/fs/xfs/libxfs/xfs_attr.c
> ...
> > > > @@ -328,48 +329,65 @@ xfs_attr_set(
> ...
> > > >         if (error) {
> > > >             ASSERT(committed);
> > > > +            if (leaf_bp)
> > > > +                xfs_buf_relse(leaf_bp);
> > > 
> > > Hmm, I don't think this is right. We don't want to release the buffer
> > > while the transaction still has a reference. Perhaps we should move this
> > > to the "out:" patch after the transaction cancel (and make sure the
> > > pointer is reset at the appropriate place)..?
> > 
> > I would've thought that the buffer would be relse'd by xfs_trans_cancel,
> > having been bjoined to the first args.trans and bheld to any subsequent
> > transactions?

(Ugh, I can't follow myself either. Sorry for the confusing question.)

> I'm not following what you mean. I guess if the transaction has been
> committed by xfs_bmap_finish(), then the buffer is actually not in the
> current transaction, but I think we need to release it either way (so
> perhaps this code is fine).

I'm confused and going to step through through this code one piece at a time.

	/*
	 * It won't fit in the shortform, transform to a leaf block.
	 * GROT: another possible req'mt for a double-split btree op.
	 */
	xfs_defer_init(args.dfops, args.firstblock);
	error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);

	/* hold the leaf buffer locked, when "args.trans" transaction commits */
	if (leaf_bp)
		xfs_trans_bhold(args.trans, leaf_bp);

If we hit an error in xfs_attr_shortform_to_leaf leaf_bp will have been
bjoined to args.trans by xfs_attr_shortform_to_leaf ->
xfs_attr3_leaf_create -> xfs_da_get_buf -> xfs_trans_get_buf_map, so
when we goto out in the error handling clause below, we'll trans_cancel,
which releases leaf_bp, right?  And we can't just xfs_buf_relse because
that would unlock a buffer that's still owned by args.trans (like Brian
said)...

	if (!error)
		error = xfs_defer_finish(&args.trans, args.dfops, dp);

...but if we hit an error here, we'll have committed the transaction but
held on to the locked buffer, so we /do/ need to xfs_buf_relse...

	if (error) {
		args.trans = NULL;
		xfs_defer_cancel(&dfops);
		goto out;
	}

...in which case we no longer can use this error handling clause for
both cases because we have differing ownership of leaf_bp.  In effect we
need something structured like this, I think?

	xfs_defer_init()
	error = xfs_attr_shortform_to_leaf(...);
	if (error) {
		xfs_defer_cancel(...);
		goto out;
	}
	if (leaf_bp)
		xfs_trans_bhold(args.trans, leaf_bp);

	error = xfs_defer_finish(...);
	if (error) {
		xfs_defer_cancel(...);
		if (leaf_bp)
			xfs_brelse(leaf_bp);
		goto out;
	}

...and then when we're ready to add the new attr:

	if (leaf_bp)
		xfs_trans_bjoin(args.trans, leaf_bp);
	error = xfs_attr_*_addname(...);


> 
> > > 
> > > >             args.trans = NULL;
> > > >             xfs_bmap_cancel(&flist);
> > > >             goto out;
> > > >         }
> > > > 
> > > >         /*
> > > >          * bmap_finish() may have committed the last trans and started
> > > >          * a new one.  We need the inode to be in all transactions.
> > > > +         * We also need to rejoin the leaf buffer to the new transaction
> > > > +         * and prevent it from being unlocked, before we commit it in
> > > > xfs_trans_roll().
> > > > +         * If bmap_finish() did not commit, then we are in the same
> > > > transaction,
> > > > +         * and no need to call xfs_trans_bhold() again.
> > > >          */
> > > > -        if (committed)
> > > > +        if (committed) {
> > > >             xfs_trans_ijoin(args.trans, dp, 0);
> > > > +            xfs_trans_bjoin(args.trans, leaf_bp);
> > > > +            xfs_trans_bhold(args.trans, leaf_bp);
> > > 
> > > I don't see why this is necessary. We still have the buffer held and
> > > locked, yes?
> > 
> > The bhold means that after the transaction roll leaf_bp is still locked
> > but not associated with a transaction, so it is necessary to bjoin
> > leaf_bp to the new transaction.  Since the very next thing we do is roll
> > the transaction again, we also want to bhold it for the second roll.
> > 
> 
> Ok, so we've created the empty block, logged it in the first
> transaction, marked it held and committed the tx. At that point the
> buffer can make it to the AIL, but we still have it locked (and held) so
> it cannot be written back. What difference does it make whether the
> buffer is added to this transaction? It won't be dirtied in the
> transaction, it can't be written back either way and we eventually join
> it to the transaction that creates the attr (after the roll).

Yes, you're right, leaf_bp is held by us and not the transaction, and we
don't need to touch leaf_bp for this empty roll, so we neither need to
bjoin it nor bhold it here.

> > (Question: is the second roll even necessary for this case?)
> > 
> 
> It's probably moot because I believe this code goes away in recent
> kernels as the committed thing was cleaned out, probably before we
> converted this over to xfs_defer_finish().

It's still there, though it's probably moot and benign.

> ...
> > > > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > > > index b7cd0a0..2e03c32 100644
> > > > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > > > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> ...
> > > > @@ -733,20 +734,22 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
> > > >         ASSERT(error == -ENOATTR);
> > > >         error = xfs_attr3_leaf_add(bp, &nargs);
> > > >         ASSERT(error != -ENOSPC);
> > > >         if (error)
> > > >             goto out;
> > > >         sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
> > > >     }
> > > >     error = 0;
> > > > 
> > > > out:
> > > > +    if (error == 0)
> > > > +        *bpp = bp;
> > > 
> > > It looks like the only way error == 0 is where it is set just above, so
> > > we could probably move this before the label.
> > > 
> > > I'm also wondering whether it might be cleaner to 1.) conditionally
> > > return the buffer when sf->hdr.count == 0 because that is the only case
> > > where the problem occurs and 2.) do the trans_bhold() here in
> > > shortform_to_leaf() when we do actually return it.
> > 
> > I /think/ you could get rid of that second roll and do the bhold in
> > shortform_to_leaf, though you'd still have to re-bjoin it after
> > defer_finish.
> > 
> 
> Yeah, the semantic is basically to return a buffer iff the callee held
> it in the tx because it was empty (maybe reneame leaf_bp to empty_bp or
> something more clear?), thereby making the caller responsible to release
> it. The caller happens to do that by joining it to the transaction that
> will add the attr.

<nod>

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >     kmem_free(tmpbuffer);
> > > >     return error;
> > > > }
> > > > 
> > > > /*
> > > >  * Check a leaf attribute block to see if all the entries would fit into
> > > >  * a shortform attribute list.
> > > >  */
> > > > int
> > > > xfs_attr_shortform_allfit(
> > > > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> > > > index 4f3a60a..d58e9e4 100644
> > > > --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> > > > +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> > > > @@ -41,21 +41,21 @@ typedef struct xfs_attr_inactive_list {
> > > >  * Function prototypes for the kernel.
> > > >  *========================================================================*/
> > > > 
> > > > /*
> > > >  * Internal routines when attribute fork size < XFS_LITINO(mp).
> > > >  */
> > > > void    xfs_attr_shortform_create(struct xfs_da_args *args);
> > > > void    xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
> > > > int    xfs_attr_shortform_lookup(struct xfs_da_args *args);
> > > > int    xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> > > > -int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> > > > +int    xfs_attr_shortform_to_leaf(struct xfs_da_args *args, xfs_buf_t
> > > > **bpp);
> > > > int    xfs_attr_shortform_remove(struct xfs_da_args *args);
> > > > int    xfs_attr_shortform_list(struct xfs_attr_list_context *context);
> > > > int    xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> > > > int    xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> > > > void    xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
> > > > 
> > > > /*
> > > >  * Internal routines when attribute fork size == XFS_LBSIZE(mp).
> > > >  */
> > > > int    xfs_attr3_leaf_to_node(struct xfs_da_args *args);
> > > > -- 
> > > > 1.9.1
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute
  2017-08-07 18:10       ` Darrick J. Wong
@ 2017-08-08  7:51         ` Alex Lyakas
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Lyakas @ 2017-08-08  7:51 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: Brian Foster, linux-xfs, Dave Chinner, Shyam Kaushik,
	Yair Hershko

Hi Darrick,

[...]

> I'm confused and going to step through through this code one piece at a time.
>
>         /*
>          * It won't fit in the shortform, transform to a leaf block.
>          * GROT: another possible req'mt for a double-split btree op.
>          */
>         xfs_defer_init(args.dfops, args.firstblock);
>         error = xfs_attr_shortform_to_leaf(&args, &leaf_bp);
>
>         /* hold the leaf buffer locked, when "args.trans" transaction commits */
>         if (leaf_bp)
>                 xfs_trans_bhold(args.trans, leaf_bp);
>
> If we hit an error in xfs_attr_shortform_to_leaf leaf_bp will have been
> bjoined to args.trans by xfs_attr_shortform_to_leaf ->
> xfs_attr3_leaf_create -> xfs_da_get_buf -> xfs_trans_get_buf_map, so
> when we goto out in the error handling clause below, we'll trans_cancel,
> which releases leaf_bp, right?  And we can't just xfs_buf_relse because
> that would unlock a buffer that's still owned by args.trans (like Brian
> said)...
If xfs_attr_shortform_to_leaf() hits an error, it will not set the
caller's leaf_bp pointer. So this pointer will remain NULL, and
therefore all the error handing logic at the caller will not touch
this xfs_buf. I believe this addresses your concern.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-08-08  7:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-07 12:54 [RFC - PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of the first attribute Alex Lyakas
2017-08-07 15:16 ` Brian Foster
2017-08-07 16:37   ` Darrick J. Wong
2017-08-07 17:11     ` Alex Lyakas
2017-08-07 17:26     ` Brian Foster
2017-08-07 18:10       ` Darrick J. Wong
2017-08-08  7:51         ` Alex Lyakas
2017-08-07 17:00   ` Alex Lyakas
2017-08-07 17:59     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.