All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] xfs: various fixes
@ 2015-07-21  1:09 Dave Chinner
  2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Dave Chinner @ 2015-07-21  1:09 UTC (permalink / raw
  To: xfs; +Cc: willy

Hi folks,

The first 3 patches fix various issues that need to head to Linus
before 4.2 is released. Patch 1 is a bug in the DAX support
resulting froma botched merge on my part - DAX doesn't do the direct
access part of DAX without this fix. (Willy, you only need to look
at this patch. :)

Patch 2 and 3 are stable kernel candidates - if nobody objects I'll
add stable cc's to them - as they affect recovery behaviour and
thought they are hard to trigger they can result in silent on-disk
corruption occurring if they do.

Patch 4 is just a cleanup that I noticed when looking at other code.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] xfs: call dax_fault on read page faults for DAX
  2015-07-21  1:09 [PATCH 0/4] xfs: various fixes Dave Chinner
@ 2015-07-21  1:09 ` Dave Chinner
  2015-07-21 11:50   ` Brian Foster
  2015-07-21 13:57   ` Matthew Wilcox
  2015-07-21  1:09 ` [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2015-07-21  1:09 UTC (permalink / raw
  To: xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c          |  9 ++++++++-
 fs/xfs/xfs_file.c | 21 +++++++++++++++------
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c3e21cc..86d2cee 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,6 +319,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
+ * @complete_unwritten: The filesystem method used to convert unwritten blocks
+ *	to written so the data written to them is exposed. This is required for
+ *	write faults, but optional for read faults as dax_insert_mapping() will
+ *	always do the right thing on a read fault (i.e. zero the underlying
+ *	page).
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
@@ -339,6 +344,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	WARN_ON_ONCE((vmf->flags & FAULT_FLAG_WRITE) && !complete_unwritten);
+
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -437,7 +444,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * as for normal BH based IO completions.
 	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh))
+	if (buffer_unwritten(&bh) && complete_unwritten)
 		complete_unwritten(&bh, !error);
 
  out:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f0e8249..db4acc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1514,18 +1514,27 @@ xfs_filemap_fault(
 	struct vm_area_struct	*vma,
 	struct vm_fault		*vmf)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
+	struct inode		*inode = file_inode(vma->vm_file);
 	int			ret;
 
-	trace_xfs_filemap_fault(ip);
+	trace_xfs_filemap_fault(XFS_I(inode));
 
 	/* DAX can shortcut the normal fault path on write faults! */
-	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
+	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
 		return xfs_filemap_page_mkwrite(vma, vmf);
 
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	ret = filemap_fault(vma, vmf);
-	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (IS_DAX(inode)) {
+		/*
+		 * we do not want to trigger unwritten extent conversion on read
+		 * faults - that is unnecessary overhead and would also require
+		 * changes to xfs_get_blocks_direct() to map unwritten extent
+		 * ioend for conversion on read-only mappings.
+		 */
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
+	} else
+		ret = filemap_fault(vma, vmf);
+	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	return ret;
 }
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN
  2015-07-21  1:09 [PATCH 0/4] xfs: various fixes Dave Chinner
  2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
@ 2015-07-21  1:09 ` Dave Chinner
  2015-07-21 11:50   ` Brian Foster
  2015-07-21  1:09 ` [PATCH 3/4] xfs: remote attributes need to be considered data Dave Chinner
  2015-07-21  1:09 ` [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-07-21  1:09 UTC (permalink / raw
  To: xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it.  It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 29 +++++++++++++++++++++++------
 fs/xfs/xfs_log_recover.c        | 11 ++++++++---
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 20de88d..2faec26 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -159,11 +159,10 @@ xfs_attr3_rmt_write_verify(
 	struct xfs_buf	*bp)
 {
 	struct xfs_mount *mp = bp->b_target->bt_mount;
-	struct xfs_buf_log_item	*bip = bp->b_fspriv;
+	int		blksize = mp->m_attr_geo->blksize;
 	char		*ptr;
 	int		len;
 	xfs_daddr_t	bno;
-	int		blksize = mp->m_attr_geo->blksize;
 
 	/* no verification of non-crc buffers */
 	if (!xfs_sb_version_hascrc(&mp->m_sb))
@@ -175,16 +174,22 @@ xfs_attr3_rmt_write_verify(
 	ASSERT(len >= blksize);
 
 	while (len > 0) {
+		struct xfs_attr3_rmt_hdr *rmt = (struct xfs_attr3_rmt_hdr *)ptr;
+
 		if (!xfs_attr3_rmt_verify(mp, ptr, blksize, bno)) {
 			xfs_buf_ioerror(bp, -EFSCORRUPTED);
 			xfs_verifier_error(bp);
 			return;
 		}
-		if (bip) {
-			struct xfs_attr3_rmt_hdr *rmt;
 
-			rmt = (struct xfs_attr3_rmt_hdr *)ptr;
-			rmt->rm_lsn = cpu_to_be64(bip->bli_item.li_lsn);
+		/*
+		 * Ensure we aren't writing bogus LSNs to disk. See
+		 * xfs_attr3_rmt_hdr_set() for the explanation.
+		 */
+		if (rmt->rm_lsn != cpu_to_be64(NULLCOMMITLSN)) {
+			xfs_buf_ioerror(bp, -EFSCORRUPTED);
+			xfs_verifier_error(bp);
+			return;
 		}
 		xfs_update_cksum(ptr, blksize, XFS_ATTR3_RMT_CRC_OFF);
 
@@ -221,6 +226,18 @@ xfs_attr3_rmt_hdr_set(
 	rmt->rm_owner = cpu_to_be64(ino);
 	rmt->rm_blkno = cpu_to_be64(bno);
 
+	/*
+	 * Remote attribute blocks are written synchronously, so we don't
+	 * have an LSN that we can stamp in them that makes any sense to log
+	 * recovery. To ensure that log recovery handles overwrites of these
+	 * blocks sanely (i.e. once they've been freed and reallocated as some
+	 * other type of metadata) we need to ensure that the LSN has a value
+	 * that tells log recovery to ignore the LSN and overwrite the buffer
+	 * with whatever is in it's log. To do this, we use the magic
+	 * NULLCOMMITLSN to indicate that the LSN is invalid.
+	 */
+	rmt->rm_lsn = cpu_to_be64(NULLCOMMITLSN);
+
 	return sizeof(struct xfs_attr3_rmt_hdr);
 }
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 01dd228..480ebba 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1886,9 +1886,14 @@ xlog_recover_get_buf_lsn(
 		uuid = &((struct xfs_dir3_blk_hdr *)blk)->uuid;
 		break;
 	case XFS_ATTR3_RMT_MAGIC:
-		lsn = be64_to_cpu(((struct xfs_attr3_rmt_hdr *)blk)->rm_lsn);
-		uuid = &((struct xfs_attr3_rmt_hdr *)blk)->rm_uuid;
-		break;
+		/*
+		 * Remote attr blocks are written synchronously, rather than
+		 * being logged. That means they do not contain a valid LSN
+		 * (i.e. transactionally ordered) in them, and hence any time we
+		 * see a buffer to replay over the top of a remote attribute
+		 * block we should simply do so.
+		 */
+		goto recover_immediately;
 	case XFS_SB_MAGIC:
 		lsn = be64_to_cpu(((struct xfs_dsb *)blk)->sb_lsn);
 		uuid = &((struct xfs_dsb *)blk)->sb_uuid;
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] xfs: remote attributes need to be considered data
  2015-07-21  1:09 [PATCH 0/4] xfs: various fixes Dave Chinner
  2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
  2015-07-21  1:09 ` [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN Dave Chinner
@ 2015-07-21  1:09 ` Dave Chinner
  2015-07-21 11:50   ` Brian Foster
  2015-07-21  1:09 ` [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-07-21  1:09 UTC (permalink / raw
  To: xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.

As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.

Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 2faec26..dd71403 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -451,14 +451,21 @@ xfs_attr_rmtval_set(
 
 		/*
 		 * Allocate a single extent, up to the size of the value.
+		 *
+		 * Note that we have to consider this a data allocation as we
+		 * write the remote attribute without logging the contents.
+		 * Hence we must ensure that we aren't using blocks that are on
+		 * the busy list so that we don't overwrite blocks which have
+		 * recently been freed but their transactions are not yet
+		 * committed to disk. If we overwrite the contents of a busy
+		 * extent and then crash then the block may not contain the
+		 * correct metadata after log recovery occurs.
 		 */
 		xfs_bmap_init(args->flist, args->firstblock);
 		nmap = 1;
 		error = xfs_bmapi_write(args->trans, dp, (xfs_fileoff_t)lblkno,
-				  blkcnt,
-				  XFS_BMAPI_ATTRFORK | XFS_BMAPI_METADATA,
-				  args->firstblock, args->total, &map, &nmap,
-				  args->flist);
+				  blkcnt, XFS_BMAPI_ATTRFORK, args->firstblock,
+				  args->total, &map, &nmap, args->flist);
 		if (!error) {
 			error = xfs_bmap_finish(&args->trans, args->flist,
 						&committed);
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag
  2015-07-21  1:09 [PATCH 0/4] xfs: various fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2015-07-21  1:09 ` [PATCH 3/4] xfs: remote attributes need to be considered data Dave Chinner
@ 2015-07-21  1:09 ` Dave Chinner
  2015-07-21 11:50   ` Brian Foster
  3 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-07-21  1:09 UTC (permalink / raw
  To: xfs; +Cc: willy

From: Dave Chinner <dchinner@redhat.com>

xfs_bunmapi() doesn't care what type of extent is being freed and
does not look at the XFS_BMAPI_METADATA flag at all. As such we can
remove the XFS_BMAPI_METADATA from all callers that use it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_remote.c |  5 ++---
 fs/xfs/libxfs/xfs_da_btree.c    |  4 ++--
 fs/xfs/libxfs/xfs_dir2.c        | 33 +++++++++++++++------------------
 fs/xfs/xfs_symlink.c            |  2 +-
 4 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index dd71403..89356aa 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -618,9 +618,8 @@ xfs_attr_rmtval_remove(
 
 		xfs_bmap_init(args->flist, args->firstblock);
 		error = xfs_bunmapi(args->trans, args->dp, lblkno, blkcnt,
-				    XFS_BMAPI_ATTRFORK | XFS_BMAPI_METADATA,
-				    1, args->firstblock, args->flist,
-				    &done);
+				    XFS_BMAPI_ATTRFORK, 1, args->firstblock,
+				    args->flist, &done);
 		if (!error) {
 			error = xfs_bmap_finish(&args->trans, args->flist,
 						&committed);
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 2385f8c..2ae91e8 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2351,8 +2351,8 @@ xfs_da_shrink_inode(
 		 * the last block to the place we want to kill.
 		 */
 		error = xfs_bunmapi(tp, dp, dead_blkno, count,
-				    xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA,
-				    0, args->firstblock, args->flist, &done);
+				    xfs_bmapi_aflag(w), 0, args->firstblock,
+				    args->flist, &done);
 		if (error == -ENOSPC) {
 			if (w != XFS_DATA_FORK)
 				break;
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index a69fb3a..e0ba976 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -674,25 +674,22 @@ xfs_dir2_shrink_inode(
 	mp = dp->i_mount;
 	tp = args->trans;
 	da = xfs_dir2_db_to_da(args->geo, db);
-	/*
-	 * Unmap the fsblock(s).
-	 */
-	if ((error = xfs_bunmapi(tp, dp, da, args->geo->fsbcount,
-			XFS_BMAPI_METADATA, 0, args->firstblock, args->flist,
-			&done))) {
+
+	/* Unmap the fsblock(s). */
+	error = xfs_bunmapi(tp, dp, da, args->geo->fsbcount, 0, 0,
+			    args->firstblock, args->flist, &done);
+	if (error) {
 		/*
-		 * ENOSPC actually can happen if we're in a removename with
-		 * no space reservation, and the resulting block removal
-		 * would cause a bmap btree split or conversion from extents
-		 * to btree.  This can only happen for un-fragmented
-		 * directory blocks, since you need to be punching out
-		 * the middle of an extent.
-		 * In this case we need to leave the block in the file,
-		 * and not binval it.
-		 * So the block has to be in a consistent empty state
-		 * and appropriately logged.
-		 * We don't free up the buffer, the caller can tell it
-		 * hasn't happened since it got an error back.
+		 * ENOSPC actually can happen if we're in a removename with no
+		 * space reservation, and the resulting block removal would
+		 * cause a bmap btree split or conversion from extents to btree.
+		 * This can only happen for un-fragmented directory blocks,
+		 * since you need to be punching out the middle of an extent.
+		 * In this case we need to leave the block in the file, and not
+		 * binval it.  So the block has to be in a consistent empty
+		 * state and appropriately logged.  We don't free up the buffer,
+		 * the caller can tell it hasn't happened since it got an error
+		 * back.
 		 */
 		return error;
 	}
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 4be27b0..05c44bf 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -501,7 +501,7 @@ xfs_inactive_symlink_rmt(
 	/*
 	 * Unmap the dead block(s) to the free_list.
 	 */
-	error = xfs_bunmapi(tp, ip, 0, size, XFS_BMAPI_METADATA, nmaps,
+	error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps,
 			    &first_block, &free_list, &done);
 	if (error)
 		goto error_bmap_cancel;
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] xfs: call dax_fault on read page faults for DAX
  2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
@ 2015-07-21 11:50   ` Brian Foster
  2015-07-21 13:57   ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Brian Foster @ 2015-07-21 11:50 UTC (permalink / raw
  To: Dave Chinner; +Cc: willy, xfs

On Tue, Jul 21, 2015 at 11:09:02AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When modifying the patch series to handle the XFS MMAP_LOCK nesting
> of page faults, I botched the conversion of the read page fault
> path, and so it is only every calling through the page cache. Re-add
> the necessary __dax_fault() call for such files.
> 
> Because the get_blocks callback on read faults may not set up the
> mapping buffer correctly to allow unwritten extent completion to be
> run, we need to allow callers of __dax_fault() to pass a null
> complete_unwritten() callback. The DAX code always zeros the
> unwritten page when it is read faulted so there are no stale data
> exposure issues with not doing the conversion. The only downside
> will be the potential for increased CPU overhead on repeated read
> faults of the same page. If this proves to be a problem, then the
> filesystem needs to fix it's get_block callback and provide a
> convert_unwritten() callback to the read fault path.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

For the XFS bits:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/dax.c          |  9 ++++++++-
>  fs/xfs/xfs_file.c | 21 +++++++++++++++------
>  2 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c3e21cc..86d2cee 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -319,6 +319,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
>   * @get_block: The filesystem method used to translate file offsets to blocks
> + * @complete_unwritten: The filesystem method used to convert unwritten blocks
> + *	to written so the data written to them is exposed. This is required for
> + *	write faults, but optional for read faults as dax_insert_mapping() will
> + *	always do the right thing on a read fault (i.e. zero the underlying
> + *	page).
>   *
>   * When a page fault occurs, filesystems may call this helper in their
>   * fault handler for DAX files. __dax_fault() assumes the caller has done all
> @@ -339,6 +344,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	int error;
>  	int major = 0;
>  
> +	WARN_ON_ONCE((vmf->flags & FAULT_FLAG_WRITE) && !complete_unwritten);
> +
>  	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	if (vmf->pgoff >= size)
>  		return VM_FAULT_SIGBUS;
> @@ -437,7 +444,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	 * as for normal BH based IO completions.
>  	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> -	if (buffer_unwritten(&bh))
> +	if (buffer_unwritten(&bh) && complete_unwritten)
>  		complete_unwritten(&bh, !error);
>  
>   out:
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f0e8249..db4acc1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1514,18 +1514,27 @@ xfs_filemap_fault(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_fault(ip);
> +	trace_xfs_filemap_fault(XFS_I(inode));
>  
>  	/* DAX can shortcut the normal fault path on write faults! */
> -	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
>  		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	ret = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	if (IS_DAX(inode)) {
> +		/*
> +		 * we do not want to trigger unwritten extent conversion on read
> +		 * faults - that is unnecessary overhead and would also require
> +		 * changes to xfs_get_blocks_direct() to map unwritten extent
> +		 * ioend for conversion on read-only mappings.
> +		 */
> +		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
> +	} else
> +		ret = filemap_fault(vma, vmf);
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	return ret;
>  }
> -- 
> 2.1.4
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN
  2015-07-21  1:09 ` [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN Dave Chinner
@ 2015-07-21 11:50   ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2015-07-21 11:50 UTC (permalink / raw
  To: Dave Chinner; +Cc: willy, xfs

On Tue, Jul 21, 2015 at 11:09:03AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In recent testing, a system that crashed failed log recovery on
> restart with a bad symlink buffer magic number:
> 
> XFS (vda): Starting recovery (logdev: internal)
> XFS (vda): Bad symlink block magic!
> XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060
> 
> On examination of the log via xfs_logprint, none of the symlink
> buffers in the log had a bad magic number, nor were any other types
> of buffer log format headers mis-identified as symlink buffers.
> Tracing was used to find the buffer the kernel was tripping over,
> and xfs_db identified it's contents as:
> 
> 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
> 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
> 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
> 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
> .....
> 
> This is a remote attribute buffer, which are notable in that they
> are not logged but are instead written synchronously by the remote
> attribute code so that they exist on disk before the attribute
> transactions are committed to the journal.
> 
> The above remote attribute block has an invalid LSN in it - cycle
> 0xd002000, block 0 - which means when log recovery comes along to
> determine if the transaction that writes to the underlying block
> should be replayed, it sees a block that has a future LSN and so
> does not replay the buffer data in the transaction. Instead, it
> validates the buffer magic number and attaches the buffer verifier
> to it.  It is this buffer magic number check that is failing in the
> above assert, indicating that we skipped replay due to the LSN of
> the underlying buffer.
> 
> The problem here is that the remote attribute buffers cannot have a
> valid LSN placed into them, because the transaction that contains
> the attribute tree pointer changes and the block allocation that the
> attribute data is being written to hasn't yet been committed. Hence
> the LSN field in the attribute block is completely unwritten,
> thereby leaving the underlying contents of the block in the LSN
> field. It could have any value, and hence a future overwrite of the
> block by log recovery may or may not work correctly.
> 
> Fix this by always writing an invalid LSN to the remote attribute
> block, as any buffer in log recovery that needs to write over the
> remote attribute should occur. We are protected from having old data
> written over the attribute by the fact that freeing the block before
> the remote attribute is written will result in the buffer being
> marked stale in the log and so all changes prior to the buffer stale
> transaction will be cancelled by log recovery.
> 
> Hence it is safe to ignore the LSN in the case or synchronously
> written, unlogged metadata such as remote attribute blocks, and to
> ensure we do that correctly, we need to write an invalid LSN to all
> remote attribute blocks to trigger immediate recovery of metadata
> that is written over the top.
> 
> As a further protection for filesystems that may already have remote
> attribute blocks with bad LSNs on disk, change the log recovery code
> to always trigger immediate recovery of metadata over remote
> attribute blocks.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_attr_remote.c | 29 +++++++++++++++++++++++------
>  fs/xfs/xfs_log_recover.c        | 11 ++++++++---
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index 20de88d..2faec26 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -159,11 +159,10 @@ xfs_attr3_rmt_write_verify(
>  	struct xfs_buf	*bp)
>  {
>  	struct xfs_mount *mp = bp->b_target->bt_mount;
> -	struct xfs_buf_log_item	*bip = bp->b_fspriv;
> +	int		blksize = mp->m_attr_geo->blksize;
>  	char		*ptr;
>  	int		len;
>  	xfs_daddr_t	bno;
> -	int		blksize = mp->m_attr_geo->blksize;
>  
>  	/* no verification of non-crc buffers */
>  	if (!xfs_sb_version_hascrc(&mp->m_sb))
> @@ -175,16 +174,22 @@ xfs_attr3_rmt_write_verify(
>  	ASSERT(len >= blksize);
>  
>  	while (len > 0) {
> +		struct xfs_attr3_rmt_hdr *rmt = (struct xfs_attr3_rmt_hdr *)ptr;
> +
>  		if (!xfs_attr3_rmt_verify(mp, ptr, blksize, bno)) {
>  			xfs_buf_ioerror(bp, -EFSCORRUPTED);
>  			xfs_verifier_error(bp);
>  			return;
>  		}
> -		if (bip) {
> -			struct xfs_attr3_rmt_hdr *rmt;
>  
> -			rmt = (struct xfs_attr3_rmt_hdr *)ptr;
> -			rmt->rm_lsn = cpu_to_be64(bip->bli_item.li_lsn);
> +		/*
> +		 * Ensure we aren't writing bogus LSNs to disk. See
> +		 * xfs_attr3_rmt_hdr_set() for the explanation.
> +		 */
> +		if (rmt->rm_lsn != cpu_to_be64(NULLCOMMITLSN)) {
> +			xfs_buf_ioerror(bp, -EFSCORRUPTED);
> +			xfs_verifier_error(bp);
> +			return;
>  		}
>  		xfs_update_cksum(ptr, blksize, XFS_ATTR3_RMT_CRC_OFF);
>  
> @@ -221,6 +226,18 @@ xfs_attr3_rmt_hdr_set(
>  	rmt->rm_owner = cpu_to_be64(ino);
>  	rmt->rm_blkno = cpu_to_be64(bno);
>  
> +	/*
> +	 * Remote attribute blocks are written synchronously, so we don't
> +	 * have an LSN that we can stamp in them that makes any sense to log
> +	 * recovery. To ensure that log recovery handles overwrites of these
> +	 * blocks sanely (i.e. once they've been freed and reallocated as some
> +	 * other type of metadata) we need to ensure that the LSN has a value
> +	 * that tells log recovery to ignore the LSN and overwrite the buffer
> +	 * with whatever is in it's log. To do this, we use the magic
> +	 * NULLCOMMITLSN to indicate that the LSN is invalid.
> +	 */
> +	rmt->rm_lsn = cpu_to_be64(NULLCOMMITLSN);
> +
>  	return sizeof(struct xfs_attr3_rmt_hdr);
>  }
>  
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 01dd228..480ebba 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1886,9 +1886,14 @@ xlog_recover_get_buf_lsn(
>  		uuid = &((struct xfs_dir3_blk_hdr *)blk)->uuid;
>  		break;
>  	case XFS_ATTR3_RMT_MAGIC:
> -		lsn = be64_to_cpu(((struct xfs_attr3_rmt_hdr *)blk)->rm_lsn);
> -		uuid = &((struct xfs_attr3_rmt_hdr *)blk)->rm_uuid;
> -		break;
> +		/*
> +		 * Remote attr blocks are written synchronously, rather than
> +		 * being logged. That means they do not contain a valid LSN
> +		 * (i.e. transactionally ordered) in them, and hence any time we
> +		 * see a buffer to replay over the top of a remote attribute
> +		 * block we should simply do so.
> +		 */
> +		goto recover_immediately;
>  	case XFS_SB_MAGIC:
>  		lsn = be64_to_cpu(((struct xfs_dsb *)blk)->sb_lsn);
>  		uuid = &((struct xfs_dsb *)blk)->sb_uuid;
> -- 
> 2.1.4
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] xfs: remote attributes need to be considered data
  2015-07-21  1:09 ` [PATCH 3/4] xfs: remote attributes need to be considered data Dave Chinner
@ 2015-07-21 11:50   ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2015-07-21 11:50 UTC (permalink / raw
  To: Dave Chinner; +Cc: willy, xfs

On Tue, Jul 21, 2015 at 11:09:04AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We don't log remote attribute contents, and instead write them
> synchronously before we commit the block allocation and attribute
> tree update transaction. As a result we are writing to the allocated
> space before the allcoation has been made permanent.
> 
> As a result, we cannot consider this allocation to be a metadata
> allocation. Metadata allocation can take blocks from the free list

							   busy list ?

> and so reuse them before the transaction that freed the block is
> committed to disk. This behaviour is perfectly fine for journalled
> metadata changes as log recovery will ensure the free operation is
> replayed before the overwrite, but for remote attribute writes this
> is not the case.
> 
> Hence we have to consider the remote attribute blocks to contain
> data and allocate accordingly. We do this by dropping the
> XFS_BMAPI_METADATA flag from the block allocation. This means the
> allocation will not use blocks that are on the busy list without
> first ensuring that the freeing transaction has been committed to
> disk and the blocks removed from the busy list. This ensures we will
> never overwrite a freed block without first ensuring that it is
> really free.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_attr_remote.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index 2faec26..dd71403 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -451,14 +451,21 @@ xfs_attr_rmtval_set(
>  
>  		/*
>  		 * Allocate a single extent, up to the size of the value.
> +		 *
> +		 * Note that we have to consider this a data allocation as we
> +		 * write the remote attribute without logging the contents.
> +		 * Hence we must ensure that we aren't using blocks that are on
> +		 * the busy list so that we don't overwrite blocks which have
> +		 * recently been freed but their transactions are not yet
> +		 * committed to disk. If we overwrite the contents of a busy
> +		 * extent and then crash then the block may not contain the
> +		 * correct metadata after log recovery occurs.
>  		 */
>  		xfs_bmap_init(args->flist, args->firstblock);
>  		nmap = 1;
>  		error = xfs_bmapi_write(args->trans, dp, (xfs_fileoff_t)lblkno,
> -				  blkcnt,
> -				  XFS_BMAPI_ATTRFORK | XFS_BMAPI_METADATA,
> -				  args->firstblock, args->total, &map, &nmap,
> -				  args->flist);
> +				  blkcnt, XFS_BMAPI_ATTRFORK, args->firstblock,
> +				  args->total, &map, &nmap, args->flist);
>  		if (!error) {
>  			error = xfs_bmap_finish(&args->trans, args->flist,
>  						&committed);
> -- 
> 2.1.4
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag
  2015-07-21  1:09 ` [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag Dave Chinner
@ 2015-07-21 11:50   ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2015-07-21 11:50 UTC (permalink / raw
  To: Dave Chinner; +Cc: willy, xfs

On Tue, Jul 21, 2015 at 11:09:05AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_bunmapi() doesn't care what type of extent is being freed and
> does not look at the XFS_BMAPI_METADATA flag at all. As such we can
> remove the XFS_BMAPI_METADATA from all callers that use it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_attr_remote.c |  5 ++---
>  fs/xfs/libxfs/xfs_da_btree.c    |  4 ++--
>  fs/xfs/libxfs/xfs_dir2.c        | 33 +++++++++++++++------------------
>  fs/xfs/xfs_symlink.c            |  2 +-
>  4 files changed, 20 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index dd71403..89356aa 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -618,9 +618,8 @@ xfs_attr_rmtval_remove(
>  
>  		xfs_bmap_init(args->flist, args->firstblock);
>  		error = xfs_bunmapi(args->trans, args->dp, lblkno, blkcnt,
> -				    XFS_BMAPI_ATTRFORK | XFS_BMAPI_METADATA,
> -				    1, args->firstblock, args->flist,
> -				    &done);
> +				    XFS_BMAPI_ATTRFORK, 1, args->firstblock,
> +				    args->flist, &done);
>  		if (!error) {
>  			error = xfs_bmap_finish(&args->trans, args->flist,
>  						&committed);
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index 2385f8c..2ae91e8 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2351,8 +2351,8 @@ xfs_da_shrink_inode(
>  		 * the last block to the place we want to kill.
>  		 */
>  		error = xfs_bunmapi(tp, dp, dead_blkno, count,
> -				    xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA,
> -				    0, args->firstblock, args->flist, &done);
> +				    xfs_bmapi_aflag(w), 0, args->firstblock,
> +				    args->flist, &done);
>  		if (error == -ENOSPC) {
>  			if (w != XFS_DATA_FORK)
>  				break;
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index a69fb3a..e0ba976 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -674,25 +674,22 @@ xfs_dir2_shrink_inode(
>  	mp = dp->i_mount;
>  	tp = args->trans;
>  	da = xfs_dir2_db_to_da(args->geo, db);
> -	/*
> -	 * Unmap the fsblock(s).
> -	 */
> -	if ((error = xfs_bunmapi(tp, dp, da, args->geo->fsbcount,
> -			XFS_BMAPI_METADATA, 0, args->firstblock, args->flist,
> -			&done))) {
> +
> +	/* Unmap the fsblock(s). */
> +	error = xfs_bunmapi(tp, dp, da, args->geo->fsbcount, 0, 0,
> +			    args->firstblock, args->flist, &done);
> +	if (error) {
>  		/*
> -		 * ENOSPC actually can happen if we're in a removename with
> -		 * no space reservation, and the resulting block removal
> -		 * would cause a bmap btree split or conversion from extents
> -		 * to btree.  This can only happen for un-fragmented
> -		 * directory blocks, since you need to be punching out
> -		 * the middle of an extent.
> -		 * In this case we need to leave the block in the file,
> -		 * and not binval it.
> -		 * So the block has to be in a consistent empty state
> -		 * and appropriately logged.
> -		 * We don't free up the buffer, the caller can tell it
> -		 * hasn't happened since it got an error back.
> +		 * ENOSPC actually can happen if we're in a removename with no
> +		 * space reservation, and the resulting block removal would
> +		 * cause a bmap btree split or conversion from extents to btree.
> +		 * This can only happen for un-fragmented directory blocks,
> +		 * since you need to be punching out the middle of an extent.
> +		 * In this case we need to leave the block in the file, and not
> +		 * binval it.  So the block has to be in a consistent empty
> +		 * state and appropriately logged.  We don't free up the buffer,
> +		 * the caller can tell it hasn't happened since it got an error
> +		 * back.
>  		 */
>  		return error;
>  	}
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 4be27b0..05c44bf 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -501,7 +501,7 @@ xfs_inactive_symlink_rmt(
>  	/*
>  	 * Unmap the dead block(s) to the free_list.
>  	 */
> -	error = xfs_bunmapi(tp, ip, 0, size, XFS_BMAPI_METADATA, nmaps,
> +	error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps,
>  			    &first_block, &free_list, &done);
>  	if (error)
>  		goto error_bmap_cancel;
> -- 
> 2.1.4
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] xfs: call dax_fault on read page faults for DAX
  2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
  2015-07-21 11:50   ` Brian Foster
@ 2015-07-21 13:57   ` Matthew Wilcox
  2015-07-24  1:31     ` [PATCH 1/4 v2] " Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2015-07-21 13:57 UTC (permalink / raw
  To: Dave Chinner; +Cc: xfs

On Tue, Jul 21, 2015 at 11:09:02AM +1000, Dave Chinner wrote:
> @@ -339,6 +344,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	int error;
>  	int major = 0;
>  
> +	WARN_ON_ONCE((vmf->flags & FAULT_FLAG_WRITE) && !complete_unwritten);
> +
>  	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	if (vmf->pgoff >= size)
>  		return VM_FAULT_SIGBUS;

This warning is always going to trigger for ext2, since it doesn't
support the concept of unwritten extents.  Instead, ext2 zeroes the block
before linking it into the tree and returning from get_block.

> @@ -437,7 +444,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	 * as for normal BH based IO completions.
>  	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> -	if (buffer_unwritten(&bh))
> +	if (buffer_unwritten(&bh) && complete_unwritten)
>  		complete_unwritten(&bh, !error);
>  
>   out:

... so maybe we should do something here like:

	if (buffer_unwritten(&bh)) {
		if (complete_unwritten)
			complete_unwritten(&bh, !error);
		else
			BUG_ON(vmf->flags & FAULT_FLAG_WRITE);
	}

(the XFS changes look fine to me)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4 v2] xfs: call dax_fault on read page faults for DAX
  2015-07-21 13:57   ` Matthew Wilcox
@ 2015-07-24  1:31     ` Dave Chinner
  2015-07-24 13:44       ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-07-24  1:31 UTC (permalink / raw
  To: Matthew Wilcox; +Cc: xfs

On Tue, Jul 21, 2015 at 09:57:58AM -0400, Matthew Wilcox wrote:
> On Tue, Jul 21, 2015 at 11:09:02AM +1000, Dave Chinner wrote:
> > @@ -339,6 +344,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	int error;
> >  	int major = 0;
> >  
> > +	WARN_ON_ONCE((vmf->flags & FAULT_FLAG_WRITE) && !complete_unwritten);
> > +
> >  	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> >  	if (vmf->pgoff >= size)
> >  		return VM_FAULT_SIGBUS;
> 
> This warning is always going to trigger for ext2, since it doesn't
> support the concept of unwritten extents.  Instead, ext2 zeroes the block
> before linking it into the tree and returning from get_block.
> 
> > @@ -437,7 +444,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	 * as for normal BH based IO completions.
> >  	 */
> >  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> > -	if (buffer_unwritten(&bh))
> > +	if (buffer_unwritten(&bh) && complete_unwritten)
> >  		complete_unwritten(&bh, !error);
> >  
> >   out:
> 
> ... so maybe we should do something here like:
> 
> 	if (buffer_unwritten(&bh)) {
> 		if (complete_unwritten)
> 			complete_unwritten(&bh, !error);
> 		else
> 			BUG_ON(vmf->flags & FAULT_FLAG_WRITE);
> 	}

I dislike BUG_ON() calls because they take production systems down
when it is not necessary. Failure to convert an unwritten extent is
not fatal - we need to warn about it so that the user understands
there's a bug that caused the data corruption they are seeing, but
we don't need crash the machine...

New patch below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: call dax_fault on read page faults for DAX

From: Dave Chinner <dchinner@redhat.com>

When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c          | 14 ++++++++++++--
 fs/xfs/xfs_file.c | 21 +++++++++++++++------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c3e21cc..a7f77e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,6 +319,12 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
+ * @complete_unwritten: The filesystem method used to convert unwritten blocks
+ *	to written so the data written to them is exposed. This is required for
+ *	required by write faults for filesystems that will return unwritten
+ *	extent mappings from @get_block, but it is optional for reads as
+ *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
+ *	not support unwritten extents, the it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
@@ -437,8 +443,12 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * as for normal BH based IO completions.
 	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh))
-		complete_unwritten(&bh, !error);
+	if (buffer_unwritten(&bh)) {
+		if (complete_unwritten)
+			complete_unwritten(&bh, !error);
+		else
+			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
+	}
 
  out:
 	if (error == -ENOMEM)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f0e8249..db4acc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1514,18 +1514,27 @@ xfs_filemap_fault(
 	struct vm_area_struct	*vma,
 	struct vm_fault		*vmf)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
+	struct inode		*inode = file_inode(vma->vm_file);
 	int			ret;
 
-	trace_xfs_filemap_fault(ip);
+	trace_xfs_filemap_fault(XFS_I(inode));
 
 	/* DAX can shortcut the normal fault path on write faults! */
-	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
+	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
 		return xfs_filemap_page_mkwrite(vma, vmf);
 
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	ret = filemap_fault(vma, vmf);
-	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (IS_DAX(inode)) {
+		/*
+		 * we do not want to trigger unwritten extent conversion on read
+		 * faults - that is unnecessary overhead and would also require
+		 * changes to xfs_get_blocks_direct() to map unwritten extent
+		 * ioend for conversion on read-only mappings.
+		 */
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
+	} else
+		ret = filemap_fault(vma, vmf);
+	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	return ret;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4 v2] xfs: call dax_fault on read page faults for DAX
  2015-07-24  1:31     ` [PATCH 1/4 v2] " Dave Chinner
@ 2015-07-24 13:44       ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2015-07-24 13:44 UTC (permalink / raw
  To: Dave Chinner; +Cc: xfs

On Fri, Jul 24, 2015 at 11:31:21AM +1000, Dave Chinner wrote:
> New patch below.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: call dax_fault on read page faults for DAX
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When modifying the patch series to handle the XFS MMAP_LOCK nesting
> of page faults, I botched the conversion of the read page fault
> path, and so it is only every calling through the page cache. Re-add
> the necessary __dax_fault() call for such files.
> 
> Because the get_blocks callback on read faults may not set up the
> mapping buffer correctly to allow unwritten extent completion to be
> run, we need to allow callers of __dax_fault() to pass a null
> complete_unwritten() callback. The DAX code always zeros the
> unwritten page when it is read faulted so there are no stale data
> exposure issues with not doing the conversion. The only downside
> will be the potential for increased CPU overhead on repeated read
> faults of the same page. If this proves to be a problem, then the
> filesystem needs to fix it's get_block callback and provide a
> convert_unwritten() callback to the read fault path.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Matthew Wilcox <willy@linux.intel.com>

> ---
>  fs/dax.c          | 14 ++++++++++++--
>  fs/xfs/xfs_file.c | 21 +++++++++++++++------
>  2 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c3e21cc..a7f77e1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -319,6 +319,12 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
>   * @get_block: The filesystem method used to translate file offsets to blocks
> + * @complete_unwritten: The filesystem method used to convert unwritten blocks
> + *	to written so the data written to them is exposed. This is required for
> + *	required by write faults for filesystems that will return unwritten
> + *	extent mappings from @get_block, but it is optional for reads as
> + *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
> + *	not support unwritten extents, the it should pass NULL.
>   *
>   * When a page fault occurs, filesystems may call this helper in their
>   * fault handler for DAX files. __dax_fault() assumes the caller has done all
> @@ -437,8 +443,12 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	 * as for normal BH based IO completions.
>  	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> -	if (buffer_unwritten(&bh))
> -		complete_unwritten(&bh, !error);
> +	if (buffer_unwritten(&bh)) {
> +		if (complete_unwritten)
> +			complete_unwritten(&bh, !error);
> +		else
> +			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
> +	}
>  
>   out:
>  	if (error == -ENOMEM)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f0e8249..db4acc1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1514,18 +1514,27 @@ xfs_filemap_fault(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_fault(ip);
> +	trace_xfs_filemap_fault(XFS_I(inode));
>  
>  	/* DAX can shortcut the normal fault path on write faults! */
> -	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
>  		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	ret = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	if (IS_DAX(inode)) {
> +		/*
> +		 * we do not want to trigger unwritten extent conversion on read
> +		 * faults - that is unnecessary overhead and would also require
> +		 * changes to xfs_get_blocks_direct() to map unwritten extent
> +		 * ioend for conversion on read-only mappings.
> +		 */
> +		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
> +	} else
> +		ret = filemap_fault(vma, vmf);
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	return ret;
>  }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/4] xfs: various fixes
@ 2016-04-05  6:05 Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-04-05  6:05 UTC (permalink / raw
  To: xfs

Hi folks,

The following series of patches are all standalone fixes for various
issues that have been reported by users. The first AGFL fix is a
recent regression, which the other three address long standing
issues.

The second patch addresses the issue of memory allocation occurring
under the CIL lock and memory reclaim requiring the CIL lock to be
taken, resulting in deadlocks in extreme low memory conditions. This
version has minor changes around the way the shadow buffer is
allocated and initialised, but is otherwise mostly unchanged from
previous RFC postings.

The third patch addresses the excessive CPU overhead of bitmap based
dirty buffer range tracking on large block sizes. It keeps a small
number of ranges per buffer, and extends and merges them as
appropriate. This completely removes the bitmap overhead from the
buffer item formatting path. This main change in this patch from
previous RFC postings is that it uses multiple ranges for tracking
rather than a single range. I decided on 4 discrete ranges as the
best tradeoff between efficiency and log space requirements.

THe last patch addresses a buffer hold-off when there is lots of
dirty metadata beign written back from the AIL. If a buffer at the
tail of the log is at a very high offset in the filesytem (e.g. AGF
header in the last AG), it could be locked for a long time while we
wait for IO to lower offsets to be dispatched. IO dispatch gets
blocked by device congestion, so in the worst case lock hold-offs
measured in seconds have been recorded. This change sorts the buffer
list into offset order before we lock any buffers, and then
dispatches IO as each buffer is successfully locked. This limits
lock holds to the length of time a buffer sits on the plug plus the
IO time, which is usually very short. This means frequently used
buffers can be locked and relogged while we wait for IO dispatch
rather than blocking attempts to access the buffer.

Comments, thoughts, etc all welcome.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-04-05  6:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-21  1:09 [PATCH 0/4] xfs: various fixes Dave Chinner
2015-07-21  1:09 ` [PATCH 1/4] xfs: call dax_fault on read page faults for DAX Dave Chinner
2015-07-21 11:50   ` Brian Foster
2015-07-21 13:57   ` Matthew Wilcox
2015-07-24  1:31     ` [PATCH 1/4 v2] " Dave Chinner
2015-07-24 13:44       ` Matthew Wilcox
2015-07-21  1:09 ` [PATCH 2/4] xfs: remote attribute headers contain an invalid LSN Dave Chinner
2015-07-21 11:50   ` Brian Foster
2015-07-21  1:09 ` [PATCH 3/4] xfs: remote attributes need to be considered data Dave Chinner
2015-07-21 11:50   ` Brian Foster
2015-07-21  1:09 ` [PATCH 4/4] xfs: xfs_bunmapi() does not need XFS_BMAPI_METADATA flag Dave Chinner
2015-07-21 11:50   ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2016-04-05  6:05 [PATCH 0/4] xfs: various fixes Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.