All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
@ 2015-06-17  7:54 Liu Bo
  2015-06-17  7:54 ` [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file Liu Bo
  2015-06-17 15:33 ` [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION David Sterba
  0 siblings, 2 replies; 28+ messages in thread
From: Liu Bo @ 2015-06-17  7:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: fdmanana

MS_I_VERSION is enabled by default for btrfs, this adds an alternative
option to toggle it off.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/super.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 05fef19..e610e3e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -324,7 +324,7 @@ enum {
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-	Opt_datasum, Opt_treelog, Opt_noinode_cache,
+	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_noi_version,
 	Opt_err,
 };
 
@@ -351,6 +351,7 @@ static match_table_t tokens = {
 	{Opt_nossd, "nossd"},
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
+	{Opt_noi_version, "noi_version"},
 	{Opt_notreelog, "notreelog"},
 	{Opt_treelog, "treelog"},
 	{Opt_flushoncommit, "flushoncommit"},
@@ -593,6 +594,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_noacl:
 			root->fs_info->sb->s_flags &= ~MS_POSIXACL;
 			break;
+		case Opt_noi_version:
+			root->fs_info->sb->s_flags &= ~MS_I_VERSION;
+			btrfs_info(root->fs_info, "disable i_version");
+			break;
 		case Opt_notreelog:
 			btrfs_set_and_info(root, NOTREELOG,
 					   "disabling tree log");
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-17  7:54 [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION Liu Bo
@ 2015-06-17  7:54 ` Liu Bo
  2015-06-17 15:58   ` David Sterba
  2015-06-17 15:33 ` [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION David Sterba
  1 sibling, 1 reply; 28+ messages in thread
From: Liu Bo @ 2015-06-17  7:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: fdmanana

If  we're overwriting an allocated file without changing timestamp
and inode version, and the file is with NODATACOW, we don't have any metadata to
commit, thus we can just flush the data device cache and go forward.

However, if there's have any change on extents' disk bytenr, inode size,
timestamp or inode version, we need to go through the normal btrfs_log_inode
path.

Test:
----------------------------
1. sysbench test of
"1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
fsync_on_each_write",
2. loop device associated with tmpfs file
3.
  - For btrfs, "-o nodatacow" and "-o noi_version" option
  - For ext4 and xfs, no extra mount options
----------------------------

Results:
----------------------------
- btrfs:
w/o: ~30Mb/sec
w:   ~131Mb/sec

- other filesystems: (both don't enable i_version by default)
ext4:  203Mb/sec
xfs:   212Mb/sec
----------------------------

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
v2: Catch errors from data writeback and skip barrier if necessary.

 fs/btrfs/btrfs_inode.h |    2 +
 fs/btrfs/disk-io.c     |    2 +-
 fs/btrfs/disk-io.h     |    1 +
 fs/btrfs/file.c        |   54 +++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/inode.c       |    3 ++
 5 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index de5e4f2..f7b99b6 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,8 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST		9
 #define BTRFS_INODE_READDIO_NEED_LOCK		10
 #define BTRFS_INODE_HAS_PROPS		        11
+#define BTRFS_INODE_NOTIMESTAMP			12
+#define BTRFS_INODE_NOISIZE			13
 /*
  * The following 3 bits are meant only for the btree inode.
  * When any of them is set, it means an error happened while writing an
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 639f266..8a41df1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3299,7 +3299,7 @@ static int write_dev_flush(struct btrfs_device *device, int wait)
  * send an empty flush down to each device in parallel,
  * then wait for them
  */
-static int barrier_all_devices(struct btrfs_fs_info *info)
+int barrier_all_devices(struct btrfs_fs_info *info)
 {
 	struct list_head *head;
 	struct btrfs_device *dev;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 27d44c0..bea982c 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
 int write_ctree_super(struct btrfs_trans_handle *trans,
 		      struct btrfs_root *root, int max_mirrors);
 struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
+int barrier_all_devices(struct btrfs_fs_info *info);
 int btrfs_commit_super(struct btrfs_root *root);
 struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
 					    u64 bytenr);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index faa7d39..180a3e1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	 * the disk i_size.  There is no need to log the inode
 	 * at this time.
 	 */
-	if (end_pos > isize)
+	if (end_pos > isize) {
 		i_size_write(inode, end_pos);
+		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
+	} else {
+		set_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
+	}
 	return 0;
 }
 
@@ -1715,19 +1719,33 @@ out:
 static void update_time_for_write(struct inode *inode)
 {
 	struct timespec now;
+	int sync_it = 0;
 
-	if (IS_NOCMTIME(inode))
+	if (IS_NOCMTIME(inode)) {
+		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
 		return;
+	}
 
 	now = current_fs_time(inode->i_sb);
-	if (!timespec_equal(&inode->i_mtime, &now))
+	if (!timespec_equal(&inode->i_mtime, &now)) {
 		inode->i_mtime = now;
+		sync_it = S_MTIME;
+	}
 
-	if (!timespec_equal(&inode->i_ctime, &now))
+	if (!timespec_equal(&inode->i_ctime, &now)) {
 		inode->i_ctime = now;
+		sync_it |= S_CTIME;
+	}
 
-	if (IS_I_VERSION(inode))
+	if (IS_I_VERSION(inode)) {
 		inode_inc_iversion(inode);
+		sync_it |= S_VERSION;
+	}
+
+	if (!sync_it)
+		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
+	else
+		clear_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
+		if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
+					&BTRFS_I(inode)->runtime_flags) &&
+		    test_and_clear_bit(BTRFS_INODE_NOISIZE,
+					&BTRFS_I(inode)->runtime_flags)) {
+
+			/* make sure data is on disk and catch error */
+			if (!full_sync)
+				ret = filemap_fdatawait_range(inode->i_mapping,
+							      start, end);
+
+			if (!ret && !btrfs_test_opt(root, NOBARRIER)) {
+				mutex_lock(&root->fs_info->
+					   fs_devices->device_list_mutex);
+				ret = barrier_all_devices(root->fs_info);
+				if (ret)
+					btrfs_error(root->fs_info, ret,
+						"errors while submitting device barriers.");	
+				mutex_unlock(&root->fs_info->
+					   fs_devices->device_list_mutex);
+			}
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
 	/*
 	 * ok we haven't committed the transaction yet, lets do a commit
 	 */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 43192e1..e2912c8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1383,6 +1383,7 @@ out_check:
 
 		btrfs_release_path(path);
 		if (cow_start != (u64)-1) {
+			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
 			ret = cow_file_range(inode, locked_page,
 					     cow_start, found_key.offset - 1,
 					     page_started, nr_written, 1);
@@ -1425,6 +1426,7 @@ out_check:
 						em->start + em->len - 1, 0);
 			}
 			type = BTRFS_ORDERED_PREALLOC;
+			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
 		} else {
 			type = BTRFS_ORDERED_NOCOW;
 		}
@@ -1463,6 +1465,7 @@ out_check:
 	}
 
 	if (cow_start != (u64)-1) {
+		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
 		ret = cow_file_range(inode, locked_page, cow_start, end,
 				     page_started, nr_written, 1);
 		if (ret)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
  2015-06-17  7:54 [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION Liu Bo
  2015-06-17  7:54 ` [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file Liu Bo
@ 2015-06-17 15:33 ` David Sterba
  2015-06-17 15:52   ` Liu Bo
  1 sibling, 1 reply; 28+ messages in thread
From: David Sterba @ 2015-06-17 15:33 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> option to toggle it off.

There's an existing generic iversion/noiversion mount option pair, no
need to extra add it to btrfs.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
  2015-06-17 15:33 ` [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION David Sterba
@ 2015-06-17 15:52   ` Liu Bo
  2015-06-17 17:01     ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Liu Bo @ 2015-06-17 15:52 UTC (permalink / raw)
  To: dsterba, linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > option to toggle it off.
> 
> There's an existing generic iversion/noiversion mount option pair, no
> need to extra add it to btrfs.

I know, it doesn't work though.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-17  7:54 ` [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file Liu Bo
@ 2015-06-17 15:58   ` David Sterba
  2015-06-18  3:27     ` Liu Bo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2015-06-17 15:58 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote:
> If  we're overwriting an allocated file without changing timestamp
> and inode version, and the file is with NODATACOW, we don't have any metadata to
> commit, thus we can just flush the data device cache and go forward.
> 
> However, if there's have any change on extents' disk bytenr, inode size,
> timestamp or inode version, we need to go through the normal btrfs_log_inode
> path.
> 
> Test:
> ----------------------------
> 1. sysbench test of
> "1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
> fsync_on_each_write",
> 2. loop device associated with tmpfs file
> 3.
>   - For btrfs, "-o nodatacow" and "-o noi_version" option
>   - For ext4 and xfs, no extra mount options
> ----------------------------
> 
> Results:
> ----------------------------
> - btrfs:
> w/o: ~30Mb/sec
> w:   ~131Mb/sec
> 
> - other filesystems: (both don't enable i_version by default)
> ext4:  203Mb/sec
> xfs:   212Mb/sec
> ----------------------------
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
> v2: Catch errors from data writeback and skip barrier if necessary.
> 
>  fs/btrfs/btrfs_inode.h |    2 +
>  fs/btrfs/disk-io.c     |    2 +-
>  fs/btrfs/disk-io.h     |    1 +
>  fs/btrfs/file.c        |   54 +++++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/inode.c       |    3 ++
>  5 files changed, 56 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index de5e4f2..f7b99b6 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -44,6 +44,8 @@
>  #define BTRFS_INODE_IN_DELALLOC_LIST		9
>  #define BTRFS_INODE_READDIO_NEED_LOCK		10
>  #define BTRFS_INODE_HAS_PROPS		        11
> +#define BTRFS_INODE_NOTIMESTAMP			12
> +#define BTRFS_INODE_NOISIZE			13

It's not clear what the flags mean and that they're related to syncing
under some conditions.

>  /*
>   * The following 3 bits are meant only for the btree inode.
>   * When any of them is set, it means an error happened while writing an
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
>  int write_ctree_super(struct btrfs_trans_handle *trans,
>  		      struct btrfs_root *root, int max_mirrors);
>  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
> +int barrier_all_devices(struct btrfs_fs_info *info);
>  int btrfs_commit_super(struct btrfs_root *root);
>  struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
>  					    u64 bytenr);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index faa7d39..180a3e1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
>  	 * the disk i_size.  There is no need to log the inode
>  	 * at this time.
>  	 */
> -	if (end_pos > isize)
> +	if (end_pos > isize) {
>  		i_size_write(inode, end_pos);
> +		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> +	} else {
> +		set_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> +	}
>  	return 0;
>  }
>  
> @@ -1715,19 +1719,33 @@ out:
>  static void update_time_for_write(struct inode *inode)
>  {
>  	struct timespec now;
> +	int sync_it = 0;
>  
> -	if (IS_NOCMTIME(inode))
> +	if (IS_NOCMTIME(inode)) {
> +		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
>  		return;
> +	}
>  
>  	now = current_fs_time(inode->i_sb);
> -	if (!timespec_equal(&inode->i_mtime, &now))
> +	if (!timespec_equal(&inode->i_mtime, &now)) {
>  		inode->i_mtime = now;
> +		sync_it = S_MTIME;
> +	}
>  
> -	if (!timespec_equal(&inode->i_ctime, &now))
> +	if (!timespec_equal(&inode->i_ctime, &now)) {
>  		inode->i_ctime = now;
> +		sync_it |= S_CTIME;
> +	}
>  
> -	if (IS_I_VERSION(inode))
> +	if (IS_I_VERSION(inode)) {
>  		inode_inc_iversion(inode);
> +		sync_it |= S_VERSION;
> +	}
> +
> +	if (!sync_it)
> +		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
> +	else
> +		clear_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
>  }
>  
>  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  		goto out;
>  	}
>  
> +	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
> +		if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
> +					&BTRFS_I(inode)->runtime_flags) &&
> +		    test_and_clear_bit(BTRFS_INODE_NOISIZE,
> +					&BTRFS_I(inode)->runtime_flags)) {
> +
> +			/* make sure data is on disk and catch error */
> +			if (!full_sync)
> +				ret = filemap_fdatawait_range(inode->i_mapping,
> +							      start, end);
> +
> +			if (!ret && !btrfs_test_opt(root, NOBARRIER)) {
> +				mutex_lock(&root->fs_info->
> +					   fs_devices->device_list_mutex);
> +				ret = barrier_all_devices(root->fs_info);

Calling barrier devices at this point looks very fishy, taking global
device locks to sync one file as well. All files in the filesystem will
pay the penalty for just one nodatacow file that's being synced.

I'm not sure that handling the NOISIZE bit is safe regarding size
extending and sync, ie. if it's properly synchronized with i_mutex from
all contexts.

> +				if (ret)
> +					btrfs_error(root->fs_info, ret,
> +						"errors while submitting device barriers.");	
> +				mutex_unlock(&root->fs_info->
> +					   fs_devices->device_list_mutex);
> +			}
> +			mutex_unlock(&inode->i_mutex);
> +			goto out;
> +		}
> +	}


> +
>  	/*
>  	 * ok we haven't committed the transaction yet, lets do a commit
>  	 */
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 43192e1..e2912c8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1383,6 +1383,7 @@ out_check:
>  
>  		btrfs_release_path(path);
>  		if (cow_start != (u64)-1) {
> +			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
>  			ret = cow_file_range(inode, locked_page,
>  					     cow_start, found_key.offset - 1,
>  					     page_started, nr_written, 1);
> @@ -1425,6 +1426,7 @@ out_check:
>  						em->start + em->len - 1, 0);
>  			}
>  			type = BTRFS_ORDERED_PREALLOC;
> +			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
>  		} else {
>  			type = BTRFS_ORDERED_NOCOW;
>  		}
> @@ -1463,6 +1465,7 @@ out_check:
>  	}
>  
>  	if (cow_start != (u64)-1) {
> +		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
>  		ret = cow_file_range(inode, locked_page, cow_start, end,
>  				     page_started, nr_written, 1);
>  		if (ret)
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
  2015-06-17 15:52   ` Liu Bo
@ 2015-06-17 17:01     ` David Sterba
  2015-06-18  2:46       ` Liu Bo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2015-06-17 17:01 UTC (permalink / raw)
  To: Liu Bo; +Cc: dsterba, linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > option to toggle it off.
> > 
> > There's an existing generic iversion/noiversion mount option pair, no
> > need to extra add it to btrfs.
> 
> I know, it doesn't work though.

Sigh, I see, btrfs forces MS_I_VERSION flag,
0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
that there's a standard way to override the defaults.

So the right way is not to do that but this will break everyhing that
relies on that behaviour at the moment. This means to add the exception
to the upper layers, either VFS or 'mount', which is not very likely to
happen.

The generic options do not reach the filesystem specific callbacks, so
we can't check it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
  2015-06-17 17:01     ` David Sterba
@ 2015-06-18  2:46       ` Liu Bo
  2015-06-18 14:38         ` i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION) David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Liu Bo @ 2015-06-18  2:46 UTC (permalink / raw)
  To: dsterba, linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote:
> On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> > On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > > option to toggle it off.
> > > 
> > > There's an existing generic iversion/noiversion mount option pair, no
> > > need to extra add it to btrfs.
> > 
> > I know, it doesn't work though.
> 
> Sigh, I see, btrfs forces MS_I_VERSION flag,
> 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
> that there's a standard way to override the defaults.
> 
> So the right way is not to do that but this will break everyhing that
> relies on that behaviour at the moment. This means to add the exception
> to the upper layers, either VFS or 'mount', which is not very likely to
> happen.
> 
> The generic options do not reach the filesystem specific callbacks, so
> we can't check it.

Ext4 also makes its own "i_version" option, so I think we can do the
same thing until more filesystems require to do it in a generic way.

The performance benefit with no_iversion is obvious for fsync
related workloads since we would avoid some expensive log commits.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-17 15:58   ` David Sterba
@ 2015-06-18  3:27     ` Liu Bo
  2015-06-24 18:21       ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Liu Bo @ 2015-06-18  3:27 UTC (permalink / raw)
  To: dsterba, linux-btrfs, fdmanana

On Wed, Jun 17, 2015 at 05:58:36PM +0200, David Sterba wrote:
> On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote:
> > If  we're overwriting an allocated file without changing timestamp
> > and inode version, and the file is with NODATACOW, we don't have any metadata to
> > commit, thus we can just flush the data device cache and go forward.
> > 
> > However, if there's have any change on extents' disk bytenr, inode size,
> > timestamp or inode version, we need to go through the normal btrfs_log_inode
> > path.
> > 
> > Test:
> > ----------------------------
> > 1. sysbench test of
> > "1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
> > fsync_on_each_write",
> > 2. loop device associated with tmpfs file
> > 3.
> >   - For btrfs, "-o nodatacow" and "-o noi_version" option
> >   - For ext4 and xfs, no extra mount options
> > ----------------------------
> > 
> > Results:
> > ----------------------------
> > - btrfs:
> > w/o: ~30Mb/sec
> > w:   ~131Mb/sec
> > 
> > - other filesystems: (both don't enable i_version by default)
> > ext4:  203Mb/sec
> > xfs:   212Mb/sec
> > ----------------------------
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> > v2: Catch errors from data writeback and skip barrier if necessary.
> > 
> >  fs/btrfs/btrfs_inode.h |    2 +
> >  fs/btrfs/disk-io.c     |    2 +-
> >  fs/btrfs/disk-io.h     |    1 +
> >  fs/btrfs/file.c        |   54 +++++++++++++++++++++++++++++++++++++++++++----
> >  fs/btrfs/inode.c       |    3 ++
> >  5 files changed, 56 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index de5e4f2..f7b99b6 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -44,6 +44,8 @@
> >  #define BTRFS_INODE_IN_DELALLOC_LIST		9
> >  #define BTRFS_INODE_READDIO_NEED_LOCK		10
> >  #define BTRFS_INODE_HAS_PROPS		        11
> > +#define BTRFS_INODE_NOTIMESTAMP			12
> > +#define BTRFS_INODE_NOISIZE			13
> 
> It's not clear what the flags mean and that they're related to syncing
> under some conditions.

What do you think about BTRFS_ILOG_NOTIMESTAMP and BTRFS_ILOG_NOISIZE? 

> 
> >  /*
> >   * The following 3 bits are meant only for the btree inode.
> >   * When any of them is set, it means an error happened while writing an
> > --- a/fs/btrfs/disk-io.h
> > +++ b/fs/btrfs/disk-io.h
> > @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
> >  int write_ctree_super(struct btrfs_trans_handle *trans,
> >  		      struct btrfs_root *root, int max_mirrors);
> >  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
> > +int barrier_all_devices(struct btrfs_fs_info *info);
> >  int btrfs_commit_super(struct btrfs_root *root);
> >  struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
> >  					    u64 bytenr);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index faa7d39..180a3e1 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
> >  	 * the disk i_size.  There is no need to log the inode
> >  	 * at this time.
> >  	 */
> > -	if (end_pos > isize)
> > +	if (end_pos > isize) {
> >  		i_size_write(inode, end_pos);
> > +		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> > +	} else {
> > +		set_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> > +	}
> >  	return 0;
> >  }
> >  
> > @@ -1715,19 +1719,33 @@ out:
> >  static void update_time_for_write(struct inode *inode)
> >  {
> >  	struct timespec now;
> > +	int sync_it = 0;
> >  
> > -	if (IS_NOCMTIME(inode))
> > +	if (IS_NOCMTIME(inode)) {
> > +		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
> >  		return;
> > +	}
> >  
> >  	now = current_fs_time(inode->i_sb);
> > -	if (!timespec_equal(&inode->i_mtime, &now))
> > +	if (!timespec_equal(&inode->i_mtime, &now)) {
> >  		inode->i_mtime = now;
> > +		sync_it = S_MTIME;
> > +	}
> >  
> > -	if (!timespec_equal(&inode->i_ctime, &now))
> > +	if (!timespec_equal(&inode->i_ctime, &now)) {
> >  		inode->i_ctime = now;
> > +		sync_it |= S_CTIME;
> > +	}
> >  
> > -	if (IS_I_VERSION(inode))
> > +	if (IS_I_VERSION(inode)) {
> >  		inode_inc_iversion(inode);
> > +		sync_it |= S_VERSION;
> > +	}
> > +
> > +	if (!sync_it)
> > +		set_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
> > +	else
> > +		clear_bit(BTRFS_INODE_NOTIMESTAMP, &BTRFS_I(inode)->runtime_flags);
> >  }
> >  
> >  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >  		goto out;
> >  	}
> >  
> > +	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
> > +		if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
> > +					&BTRFS_I(inode)->runtime_flags) &&
> > +		    test_and_clear_bit(BTRFS_INODE_NOISIZE,
> > +					&BTRFS_I(inode)->runtime_flags)) {
> > +
> > +			/* make sure data is on disk and catch error */
> > +			if (!full_sync)
> > +				ret = filemap_fdatawait_range(inode->i_mapping,
> > +							      start, end);
> > +
> > +			if (!ret && !btrfs_test_opt(root, NOBARRIER)) {
> > +				mutex_lock(&root->fs_info->
> > +					   fs_devices->device_list_mutex);
> > +				ret = barrier_all_devices(root->fs_info);
> 
> Calling barrier devices at this point looks very fishy, taking global
> device locks to sync one file as well. All files in the filesystem will
> pay the penalty for just one nodatacow file that's being synced.

Well, I'm afraid this is necessary as this is a fsync, an expensive operation,
in the normal case, each fsync issues a superblock flush which calls barrier devices.

I was expecting to not take the global device lock but btrfs is able to
manage multiple devices which requires us to do so.

> 
> I'm not sure that handling the NOISIZE bit is safe regarding size
> extending and sync, ie. if it's properly synchronized with i_mutex from
> all contexts.

That's also my concern, but the worst case is that someone clears
NOISIZE bit and we continue on the normal fsync path.

And this NOISIZE bit not only stands for i_size change, but also will be
cleared when we do COW, I'm not sure if we need to use another bit for
the COW change or not.

Thanks,

-liubo

> 
> > +				if (ret)
> > +					btrfs_error(root->fs_info, ret,
> > +						"errors while submitting device barriers.");	
> > +				mutex_unlock(&root->fs_info->
> > +					   fs_devices->device_list_mutex);
> > +			}
> > +			mutex_unlock(&inode->i_mutex);
> > +			goto out;
> > +		}
> > +	}
> 
> 
> > +
> >  	/*
> >  	 * ok we haven't committed the transaction yet, lets do a commit
> >  	 */
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 43192e1..e2912c8 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1383,6 +1383,7 @@ out_check:
> >  
> >  		btrfs_release_path(path);
> >  		if (cow_start != (u64)-1) {
> > +			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> >  			ret = cow_file_range(inode, locked_page,
> >  					     cow_start, found_key.offset - 1,
> >  					     page_started, nr_written, 1);
> > @@ -1425,6 +1426,7 @@ out_check:
> >  						em->start + em->len - 1, 0);
> >  			}
> >  			type = BTRFS_ORDERED_PREALLOC;
> > +			clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> >  		} else {
> >  			type = BTRFS_ORDERED_NOCOW;
> >  		}
> > @@ -1463,6 +1465,7 @@ out_check:
> >  	}
> >  
> >  	if (cow_start != (u64)-1) {
> > +		clear_bit(BTRFS_INODE_NOISIZE, &BTRFS_I(inode)->runtime_flags);
> >  		ret = cow_file_range(inode, locked_page, cow_start, end,
> >  				     page_started, nr_written, 1);
> >  		if (ret)
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-18  2:46       ` Liu Bo
@ 2015-06-18 14:38         ` David Sterba
  2015-06-19 11:44             ` Karel Zak
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David Sterba @ 2015-06-18 14:38 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, fdmanana, kzak, linux-fsdevel, viro

Moving the discussion to fsdevel.

Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
generic 'noiversion' option cannot be used to achieve that. It is
processed before it reaches btrfs superblock callback, where
MS_I_VERSION is forced.

The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
to which I object.

Continued below.

On Thu, Jun 18, 2015 at 10:46:09AM +0800, Liu Bo wrote:
> On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote:
> > On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> > > On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > > > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > > > option to toggle it off.
> > > > 
> > > > There's an existing generic iversion/noiversion mount option pair, no
> > > > need to extra add it to btrfs.
> > > 
> > > I know, it doesn't work though.
> > 
> > Sigh, I see, btrfs forces MS_I_VERSION flag,
> > 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
> > that there's a standard way to override the defaults.
> > 
> > So the right way is not to do that but this will break everyhing that
> > relies on that behaviour at the moment. This means to add the exception
> > to the upper layers, either VFS or 'mount', which is not very likely to
> > happen.
> > 
> > The generic options do not reach the filesystem specific callbacks, so
> > we can't check it.
> 
> Ext4 also makes its own "i_version" option, so I think we can do the
> same thing until more filesystems require to do it in a generic way.

AFAICS, ext4 had added it's own i_version before iversion was added to
mount:

ext4:

Commit:      25ec56b518257a56d2ff41a941d288e4b5ff9488
Commit date: Mon Jan 28 23:58:27 2008 -0500
Subject:     ext4: Add inode version support in ext4

util-linux:

Commti:      4fa5e73d16828c94234ba0aeafaec2470f79011c
Commit date: Thu Nov 27 12:08:44 2008 +0100
Subject:     mount: add i_version support

I don't know the history, this looks like adding the options was not
coordinated.

> The performance benefit with no_iversion is obvious for fsync
> related workloads since we would avoid some expensive log commits.

It is obviuos, but I'd like to avoid cluttering the mount options
interface further.

xfs also forces I_VERSION if it detects the superblock version 5, so it
could use the same fix that would work for btrfs.

I see two possibilities that pretend to be generic and clean:

1) the filesystem MS_I_* defaults would be exported and processed up the
   mount call stack

2) pass the full mount options to the filesystem (if requested eg. by
   file_system_type::fs_flags bits).

The other ideas contain 'make an exception to ... ' which does not sound
appealing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-18 14:38         ` i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION) David Sterba
@ 2015-06-19 11:44             ` Karel Zak
  2015-06-22 20:42             ` Dave Chinner
  2015-06-23 16:32           ` Theodore Ts'o
  2 siblings, 0 replies; 28+ messages in thread
From: Karel Zak @ 2015-06-19 11:44 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, linux-fsdevel, viro

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> AFAICS, ext4 had added it's own i_version before iversion was added to
> mount:

It is not so unusual that some mount option is introduced as
fs-specific and later re-implemented as generic (another example is
MS_LAZYTIME). The mount(8) cares about the generic options only.

> ext4:
> 
> Commit:      25ec56b518257a56d2ff41a941d288e4b5ff9488
> Commit date: Mon Jan 28 23:58:27 2008 -0500
> Subject:     ext4: Add inode version support in ext4
> 
> util-linux:
> 
> Commti:      4fa5e73d16828c94234ba0aeafaec2470f79011c
> Commit date: Thu Nov 27 12:08:44 2008 +0100
> Subject:     mount: add i_version support
> 
> I don't know the history, this looks like adding the options was not
> coordinated.

I don't remember this change, but MS_I_VERSION is part of the kernel
API (include/uapi/linux/fs.h) so I guess it's fine to support it in
mount(8).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-19 11:44             ` Karel Zak
  0 siblings, 0 replies; 28+ messages in thread
From: Karel Zak @ 2015-06-19 11:44 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, linux-fsdevel, viro

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> AFAICS, ext4 had added it's own i_version before iversion was added to
> mount:

It is not so unusual that some mount option is introduced as
fs-specific and later re-implemented as generic (another example is
MS_LAZYTIME). The mount(8) cares about the generic options only.

> ext4:
> 
> Commit:      25ec56b518257a56d2ff41a941d288e4b5ff9488
> Commit date: Mon Jan 28 23:58:27 2008 -0500
> Subject:     ext4: Add inode version support in ext4
> 
> util-linux:
> 
> Commti:      4fa5e73d16828c94234ba0aeafaec2470f79011c
> Commit date: Thu Nov 27 12:08:44 2008 +0100
> Subject:     mount: add i_version support
> 
> I don't know the history, this looks like adding the options was not
> coordinated.

I don't remember this change, but MS_I_VERSION is part of the kernel
API (include/uapi/linux/fs.h) so I guess it's fine to support it in
mount(8).

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-18 14:38         ` i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION) David Sterba
@ 2015-06-22 20:42             ` Dave Chinner
  2015-06-22 20:42             ` Dave Chinner
  2015-06-23 16:32           ` Theodore Ts'o
  2 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-06-22 20:42 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

The issue is that you can't overide IS_I_VERSION(inode) because it
looks at the superblock flag, yes?

So perhaps IS_I_VERSION should become an inode flag, set by the
filesystem at inode instantiation time, and hence filesystems can
choose on a per-inode basis if they want I_VERSION behaviour or not.
At that point, the behaviour of MS_I_VERSION becomes irrelevant to
the discussion, doesn't it?

> xfs also forces I_VERSION if it detects the superblock version 5, so it
> could use the same fix that would work for btrfs.

XFS is a special snowflake - it updates the I_VERSION only when an
inode is otherwise modified in a transaction, so turning it off
saves nothing. (And yes, timestamp updates are transactional in
XFS). Hence XFS behaviour is irrelevant to the discussion, because
we aren't ever going to turn it off....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-22 20:42             ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-06-22 20:42 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

The issue is that you can't overide IS_I_VERSION(inode) because it
looks at the superblock flag, yes?

So perhaps IS_I_VERSION should become an inode flag, set by the
filesystem at inode instantiation time, and hence filesystems can
choose on a per-inode basis if they want I_VERSION behaviour or not.
At that point, the behaviour of MS_I_VERSION becomes irrelevant to
the discussion, doesn't it?

> xfs also forces I_VERSION if it detects the superblock version 5, so it
> could use the same fix that would work for btrfs.

XFS is a special snowflake - it updates the I_VERSION only when an
inode is otherwise modified in a transaction, so turning it off
saves nothing. (And yes, timestamp updates are transactional in
XFS). Hence XFS behaviour is irrelevant to the discussion, because
we aren't ever going to turn it off....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-18 14:38         ` i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION) David Sterba
  2015-06-19 11:44             ` Karel Zak
  2015-06-22 20:42             ` Dave Chinner
@ 2015-06-23 16:32           ` Theodore Ts'o
  2015-06-24  8:23             ` Liu Bo
                               ` (2 more replies)
  2 siblings, 3 replies; 28+ messages in thread
From: Theodore Ts'o @ 2015-06-23 16:32 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

I was talking to Mingming about this on today's ext4 conference call,
and one of the reasons why ext4 turns off i_version update by default
is because it does a real number on our performance as well --- and
furthermore, the only real user of the field from what we can tell is
NFSv4, which not all that many ext4 users actually care about.

This has caused pain for the nfsv4 folks since it means that they need
to tell people to use a special mount option for ext4 if they are
actually using this for nfsv4, and I suspect they won't be all that
eager to hear that btrfs is going to go the same way.

This however got us thinking --- even in if NFSv4 is depending on
i_version, it doesn't actually _look_ at that field all that often.
It's only going to look at it in a response to a client's getattr
call, and that in turn is used to so the client can do its local disk
cache invalidation if anby of the data blocks of the inode has changed.

So what if we have a per-inode flag which "don't update I_VERSION",
which is off by default, but after the i_version has been updated at
least once, is set, so the i_version field won't be updated again ---
at least until something has actually looked at the i_version field,
when the "don't update I_VERSOIN" flag will get cleared again.

So basically, if we know there are no microphones in the forest, we
don't need to make the tree fall.  However, if someone has sampled the
i_version field, then the next time the inode gets updated, we will
update the i_version field so the NFSv4 client can hear the sound of
the tree crashing to the forst floor and so it can invalidate its
local cache of the file.  :-)

This should significantly improve the performance of using the
i_version field if the file system is being exported via NFSv4, and if
NFSv4 is not in use, no one will be looking at the i_version field, so
the performance impact will be very slight, and thus we could enable
i_version updates by default for btrfs and ext4.

And this should make the distribution folks happy, since it will unify
the behavior of all file systems, and make life easier for users who
won't need to set certain magic mount options depending on what file
system they are using and whether they are using NFSv4 or not.

Does this sound reasonable?

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-23 16:32           ` Theodore Ts'o
@ 2015-06-24  8:23             ` Liu Bo
  2015-06-24 18:02             ` David Sterba
  2015-06-25 18:46               ` J. Bruce Fields
  2 siblings, 0 replies; 28+ messages in thread
From: Liu Bo @ 2015-06-24  8:23 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
> 
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.
> 
> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
> 
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall.  However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file.  :-)
> 
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
> 
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

I agree, this's a promising way to fix the whole thing.

Regarding to client's getattr, I found that inode->i_version is not read by
calling generic_fillattr(), so I'm a bit missing on how we get to
change the flag...

Thanks,

-liubo

> 
> Cheers,
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-22 20:42             ` Dave Chinner
  (?)
@ 2015-06-24 17:28             ` David Sterba
  -1 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2015-06-24 17:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro

On Tue, Jun 23, 2015 at 06:42:15AM +1000, Dave Chinner wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> The issue is that you can't overide IS_I_VERSION(inode) because it
> looks at the superblock flag, yes?

Effectively, yes.

> So perhaps IS_I_VERSION should become an inode flag, set by the
> filesystem at inode instantiation time, and hence filesystems can
> choose on a per-inode basis if they want I_VERSION behaviour or not.

Sounds good, I like that. Looking at the proposed usecase again, the
performance speedup needs the NODATACOW bit set as well, so setting
one more bit is not a big deal.

Besides, the global 'noi_version' does not have the expected effect
because inode::i_version is incremented unconditionally everywhere
(except 1 call site).  From that perspective I think that the
inode-specific bit is the right approach.

> At that point, the behaviour of MS_I_VERSION becomes irrelevant to
> the discussion, doesn't it?

Agreed.

> > xfs also forces I_VERSION if it detects the superblock version 5, so it
> > could use the same fix that would work for btrfs.
> 
> XFS is a special snowflake - it updates the I_VERSION only when an
> inode is otherwise modified in a transaction, so turning it off
> saves nothing. (And yes, timestamp updates are transactional in
> XFS). Hence XFS behaviour is irrelevant to the discussion, because
> we aren't ever going to turn it off....

Understood.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-23 16:32           ` Theodore Ts'o
  2015-06-24  8:23             ` Liu Bo
@ 2015-06-24 18:02             ` David Sterba
  2015-06-24 23:17                 ` Theodore Ts'o
  2015-06-25 18:46               ` J. Bruce Fields
  2 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2015-06-24 18:02 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.

I did not mean to change the default behaviour with respect to nfs, that
would be a regression.

> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.

This sounds similar to what Dave proposed, a per-inode I_VERSION
attribute that can be changed through chattr. Though the negated meaning
of the flag could be confusing, I had to reread the paragraph again.

> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.

Btrfs default is to update i_version and the uscecase gets fixed by the
per-inode attribute. But from your description above I think that this
might not be enough for ext4. The reason I see are the different
defaults. You want to turn it on by default but not impose any
performance penalty for that, while for our usecase it's sufficient to
selectively turn it off.

I've tried to locate the source of performance drop for ext4 + iversion,
but was not successful so I don't know if the proposed fix is complete.

> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

It does, the unified behaviour wrt NFS is definitely a good thing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-18  3:27     ` Liu Bo
@ 2015-06-24 18:21       ` David Sterba
  2015-06-25  2:24         ` Liu Bo
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2015-06-24 18:21 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, fdmanana

On Thu, Jun 18, 2015 at 11:27:24AM +0800, Liu Bo wrote:
> > >  #define BTRFS_INODE_IN_DELALLOC_LIST		9
> > >  #define BTRFS_INODE_READDIO_NEED_LOCK		10
> > >  #define BTRFS_INODE_HAS_PROPS		        11
> > > +#define BTRFS_INODE_NOTIMESTAMP			12
> > > +#define BTRFS_INODE_NOISIZE			13
> > 
> > It's not clear what the flags mean and that they're related to syncing
> > under some conditions.
> 
> What do you think about BTRFS_ILOG_NOTIMESTAMP and BTRFS_ILOG_NOISIZE? 

I'd say BTRFS_INODE_FSYNC_NOTIMESTAMP and BTRFS_INODE_FSYNC_NOSIZE

> > > @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> > >  		goto out;
> > >  	}
> > >  
> > > +	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
> > > +		if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
> > > +					&BTRFS_I(inode)->runtime_flags) &&
> > > +		    test_and_clear_bit(BTRFS_INODE_NOISIZE,
> > > +					&BTRFS_I(inode)->runtime_flags)) {
> > > +
> > > +			/* make sure data is on disk and catch error */
> > > +			if (!full_sync)
> > > +				ret = filemap_fdatawait_range(inode->i_mapping,
> > > +							      start, end);
> > > +
> > > +			if (!ret && !btrfs_test_opt(root, NOBARRIER)) {
> > > +				mutex_lock(&root->fs_info->
> > > +					   fs_devices->device_list_mutex);
> > > +				ret = barrier_all_devices(root->fs_info);
> > 
> > Calling barrier devices at this point looks very fishy, taking global
> > device locks to sync one file as well. All files in the filesystem will
> > pay the penalty for just one nodatacow file that's being synced.
> 
> Well, I'm afraid this is necessary as this is a fsync, an expensive operation,
> in the normal case, each fsync issues a superblock flush which calls barrier devices.
> 
> I was expecting to not take the global device lock but btrfs is able to
> manage multiple devices which requires us to do so.

I've read the code again and came to the same conclusion, objections
withdrawn.

> > I'm not sure that handling the NOISIZE bit is safe regarding size
> > extending and sync, ie. if it's properly synchronized with i_mutex from
> > all contexts.
> 
> That's also my concern, but the worst case is that someone clears
> NOISIZE bit and we continue on the normal fsync path.

Sounds safe.

> And this NOISIZE bit not only stands for i_size change, but also will be
> cleared when we do COW, I'm not sure if we need to use another bit for
> the COW change or not.

I'm not sure I understand, you mean split the NOISIZE into two bits and
use NOISIZE just for inode size change and the other one for the
cow_file_range case?

Btw, shouln't the NOISIZE bit get cleared inside cow_file_range? Both
calls are in run_delalloc_nocow, this makes sense, but I'm a bit worried
that we could forget to add it somewhere else. I don't think this would
hurt performance, cow_file_range is pretty big.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-24 23:17                 ` Theodore Ts'o
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Ts'o @ 2015-06-24 23:17 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> 
> This sounds similar to what Dave proposed, a per-inode I_VERSION
> attribute that can be changed through chattr. Though the negated meaning
> of the flag could be confusing, I had to reread the paragraph again.

Dave did not specify an I_VERSION attribute that would be stored on
disk.  Instead he talked about a inode flag that would be set when the
struct inode is created by the file system.

This would allow file systems to permanently configure (on a per-inode
basis) whether or not a particular inode would require a forced
i_version update any time the inode's data or metadata is modified.  I
suppose you could initialized the inode flag from an on-disk
attribute, but that wasn't implied by Dave's proposal, at least as I
understood it.

> > This should significantly improve the performance of using the
> > i_version field if the file system is being exported via NFSv4, and if
> > NFSv4 is not in use, no one will be looking at the i_version field, so
> > the performance impact will be very slight, and thus we could enable
> > i_version updates by default for btrfs and ext4.
> 
> Btrfs default is to update i_version and the uscecase gets fixed by the
> per-inode attribute. But from your description above I think that this
> might not be enough for ext4. The reason I see are the different
> defaults. You want to turn it on by default but not impose any
> performance penalty for that, while for our usecase it's sufficient to
> selectively turn it off.

The problem with selectively turning it off is that the user might
decide for a particular file which is getting lots of updates to turn
off i_version updates --- but then at some subsequent time, that file
is part of a file system which is exported NFSv4, clients will
mysteriously break because i_version was manually turned off.

So this is going to be a potential support nightmare for enterprise
distro help desks --- the user will report that a particular file is
constantly getting corrupted, due to the NFSv4 cache invalidation
getting broken, and it might not be obvious why this is only happening
with this one file, and it's because with btrfs, the i_version update
for particular file was selectively turned off.  I don't think it's a
good design where it is easy for the user to set a flag which breaks
functionality, and in a potentially confusing way, especially when the
net result is potential data corruption.

This is why I would much rather have the default be on, but with
minimal (preferably not measurable) performance overhead.  It's the
best of both worlds.

						- Ted
						

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-24 23:17                 ` Theodore Ts'o
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Ts'o @ 2015-06-24 23:17 UTC (permalink / raw)
  To: dsterba-AlSwsSmVLrQ, Liu Bo, linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	fdmanana-IBi9RG/b67k, kzak-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA,
	mingming.cao-QHcLZuEGTsvQT0dZR+AlfA

On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> 
> This sounds similar to what Dave proposed, a per-inode I_VERSION
> attribute that can be changed through chattr. Though the negated meaning
> of the flag could be confusing, I had to reread the paragraph again.

Dave did not specify an I_VERSION attribute that would be stored on
disk.  Instead he talked about a inode flag that would be set when the
struct inode is created by the file system.

This would allow file systems to permanently configure (on a per-inode
basis) whether or not a particular inode would require a forced
i_version update any time the inode's data or metadata is modified.  I
suppose you could initialized the inode flag from an on-disk
attribute, but that wasn't implied by Dave's proposal, at least as I
understood it.

> > This should significantly improve the performance of using the
> > i_version field if the file system is being exported via NFSv4, and if
> > NFSv4 is not in use, no one will be looking at the i_version field, so
> > the performance impact will be very slight, and thus we could enable
> > i_version updates by default for btrfs and ext4.
> 
> Btrfs default is to update i_version and the uscecase gets fixed by the
> per-inode attribute. But from your description above I think that this
> might not be enough for ext4. The reason I see are the different
> defaults. You want to turn it on by default but not impose any
> performance penalty for that, while for our usecase it's sufficient to
> selectively turn it off.

The problem with selectively turning it off is that the user might
decide for a particular file which is getting lots of updates to turn
off i_version updates --- but then at some subsequent time, that file
is part of a file system which is exported NFSv4, clients will
mysteriously break because i_version was manually turned off.

So this is going to be a potential support nightmare for enterprise
distro help desks --- the user will report that a particular file is
constantly getting corrupted, due to the NFSv4 cache invalidation
getting broken, and it might not be obvious why this is only happening
with this one file, and it's because with btrfs, the i_version update
for particular file was selectively turned off.  I don't think it's a
good design where it is easy for the user to set a flag which breaks
functionality, and in a potentially confusing way, especially when the
net result is potential data corruption.

This is why I would much rather have the default be on, but with
minimal (preferably not measurable) performance overhead.  It's the
best of both worlds.

						- Ted
						
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-24 23:59                   ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-06-24 23:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Wed, Jun 24, 2015 at 07:17:50PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> > 
> > This sounds similar to what Dave proposed, a per-inode I_VERSION
> > attribute that can be changed through chattr. Though the negated meaning
> > of the flag could be confusing, I had to reread the paragraph again.
> 
> Dave did not specify an I_VERSION attribute that would be stored on
> disk.  Instead he talked about a inode flag that would be set when the
> struct inode is created by the file system.

Right.

> This would allow file systems to permanently configure (on a per-inode
> basis) whether or not a particular inode would require a forced
> i_version update any time the inode's data or metadata is modified.  I
> suppose you could initialized the inode flag from an on-disk
> attribute, but that wasn't implied by Dave's proposal, at least as I
> understood it.

It enables filesystems to do this. If btrfs want to add an on-disk
flag to turn off I_VERSION on a per-inode basis, or imply it from
some other on-disk flag, then they are welcome to do so and the
above infrastructure change will support it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-24 23:59                   ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2015-06-24 23:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba-AlSwsSmVLrQ, Liu Bo, linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	fdmanana-IBi9RG/b67k, kzak-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA,
	mingming.cao-QHcLZuEGTsvQT0dZR+AlfA

On Wed, Jun 24, 2015 at 07:17:50PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> > 
> > This sounds similar to what Dave proposed, a per-inode I_VERSION
> > attribute that can be changed through chattr. Though the negated meaning
> > of the flag could be confusing, I had to reread the paragraph again.
> 
> Dave did not specify an I_VERSION attribute that would be stored on
> disk.  Instead he talked about a inode flag that would be set when the
> struct inode is created by the file system.

Right.

> This would allow file systems to permanently configure (on a per-inode
> basis) whether or not a particular inode would require a forced
> i_version update any time the inode's data or metadata is modified.  I
> suppose you could initialized the inode flag from an on-disk
> attribute, but that wasn't implied by Dave's proposal, at least as I
> understood it.

It enables filesystems to do this. If btrfs want to add an on-disk
flag to turn off I_VERSION on a per-inode basis, or imply it from
some other on-disk flag, then they are welcome to do so and the
above infrastructure change will support it.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-24 18:21       ` David Sterba
@ 2015-06-25  2:24         ` Liu Bo
  2015-06-25 16:10           ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Liu Bo @ 2015-06-25  2:24 UTC (permalink / raw)
  To: dsterba, linux-btrfs, fdmanana

On Wed, Jun 24, 2015 at 08:21:30PM +0200, David Sterba wrote:
> On Thu, Jun 18, 2015 at 11:27:24AM +0800, Liu Bo wrote:
> > > >  #define BTRFS_INODE_IN_DELALLOC_LIST		9
> > > >  #define BTRFS_INODE_READDIO_NEED_LOCK		10
> > > >  #define BTRFS_INODE_HAS_PROPS		        11
> > > > +#define BTRFS_INODE_NOTIMESTAMP			12
> > > > +#define BTRFS_INODE_NOISIZE			13
> > > 
> > > It's not clear what the flags mean and that they're related to syncing
> > > under some conditions.
> > 
> > What do you think about BTRFS_ILOG_NOTIMESTAMP and BTRFS_ILOG_NOISIZE? 
> 
> I'd say BTRFS_INODE_FSYNC_NOTIMESTAMP and BTRFS_INODE_FSYNC_NOSIZE

Looks good.

> 
> > > > @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> > > >  		goto out;
> > > >  	}
> > > >  
> > > > +	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
> > > > +		if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
> > > > +					&BTRFS_I(inode)->runtime_flags) &&
> > > > +		    test_and_clear_bit(BTRFS_INODE_NOISIZE,
> > > > +					&BTRFS_I(inode)->runtime_flags)) {
> > > > +
> > > > +			/* make sure data is on disk and catch error */
> > > > +			if (!full_sync)
> > > > +				ret = filemap_fdatawait_range(inode->i_mapping,
> > > > +							      start, end);
> > > > +
> > > > +			if (!ret && !btrfs_test_opt(root, NOBARRIER)) {
> > > > +				mutex_lock(&root->fs_info->
> > > > +					   fs_devices->device_list_mutex);
> > > > +				ret = barrier_all_devices(root->fs_info);
> > > 
> > > Calling barrier devices at this point looks very fishy, taking global
> > > device locks to sync one file as well. All files in the filesystem will
> > > pay the penalty for just one nodatacow file that's being synced.
> > 
> > Well, I'm afraid this is necessary as this is a fsync, an expensive operation,
> > in the normal case, each fsync issues a superblock flush which calls barrier devices.
> > 
> > I was expecting to not take the global device lock but btrfs is able to
> > manage multiple devices which requires us to do so.
> 
> I've read the code again and came to the same conclusion, objections
> withdrawn.
> 
> > > I'm not sure that handling the NOISIZE bit is safe regarding size
> > > extending and sync, ie. if it's properly synchronized with i_mutex from
> > > all contexts.
> > 
> > That's also my concern, but the worst case is that someone clears
> > NOISIZE bit and we continue on the normal fsync path.
> 
> Sounds safe.
> 
> > And this NOISIZE bit not only stands for i_size change, but also will be
> > cleared when we do COW, I'm not sure if we need to use another bit for
> > the COW change or not.
> 
> I'm not sure I understand, you mean split the NOISIZE into two bits and
> use NOISIZE just for inode size change and the other one for the
> cow_file_range case?

Yes, for now it has mixed meanings, either changing i_size or doing Cow.
But I think it'd better to leave it mixed if we document it well.

> 
> Btw, shouln't the NOISIZE bit get cleared inside cow_file_range? Both
> calls are in run_delalloc_nocow, this makes sense, but I'm a bit worried
> that we could forget to add it somewhere else. I don't think this would
> hurt performance, cow_file_range is pretty big.

That sounds be better.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
  2015-06-25  2:24         ` Liu Bo
@ 2015-06-25 16:10           ` David Sterba
  0 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2015-06-25 16:10 UTC (permalink / raw)
  To: Liu Bo; +Cc: dsterba, linux-btrfs, fdmanana

On Thu, Jun 25, 2015 at 10:24:25AM +0800, Liu Bo wrote:
> > I'm not sure I understand, you mean split the NOISIZE into two bits and
> > use NOISIZE just for inode size change and the other one for the
> > cow_file_range case?
> 
> Yes, for now it has mixed meanings, either changing i_size or doing Cow.
> But I think it'd better to leave it mixed if we document it well.

Agreed, please also comment the new fsync code that actually uses the
nocow + notimestamp + noisize bits.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-25 18:46               ` J. Bruce Fields
  0 siblings, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2015-06-25 18:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
> 
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.

Yes, thanks for looking into this!

> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.

Most clients will query it on every write.  (I just took a quick look at
the code and I believe the Linux client's requesting it immediately
after every write, except in the O_DIRECT and delegated cases.)

> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
> 
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall.  However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file.  :-)
> 
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
> 
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

Just to make sure I understand, the logic is something like:

	to read the i_version:

		inode->i_version_seen = true;
		return inode->i_version

	to update the i_version:

		/*
		 * If nobody's seen this value of i_version then we can
		 * keep using it, otherwise we need a new one:
		 */
		if (inode->i_version_seen)
			inode->i_version++;
		inode->i_version_seen = false;

Looks OK to me.  As I say I'd expect i_version_seen == true to end up
being the common case in a lot of v4 workloads, so I'm more skeptical of
the claim of a performance improvement in the v4 case.

Could maintaining the new flag be a significant drag in itself?  If not,
then I guess we're not making things any worse there, so fine.

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
@ 2015-06-25 18:46               ` J. Bruce Fields
  0 siblings, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2015-06-25 18:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba-AlSwsSmVLrQ, Liu Bo, linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	fdmanana-IBi9RG/b67k, kzak-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA,
	mingming.cao-QHcLZuEGTsvQT0dZR+AlfA

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
> 
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.

Yes, thanks for looking into this!

> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.

Most clients will query it on every write.  (I just took a quick look at
the code and I believe the Linux client's requesting it immediately
after every write, except in the O_DIRECT and delegated cases.)

> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
> 
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall.  However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file.  :-)
> 
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
> 
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

Just to make sure I understand, the logic is something like:

	to read the i_version:

		inode->i_version_seen = true;
		return inode->i_version

	to update the i_version:

		/*
		 * If nobody's seen this value of i_version then we can
		 * keep using it, otherwise we need a new one:
		 */
		if (inode->i_version_seen)
			inode->i_version++;
		inode->i_version_seen = false;

Looks OK to me.  As I say I'd expect i_version_seen == true to end up
being the common case in a lot of v4 workloads, so I'm more skeptical of
the claim of a performance improvement in the v4 case.

Could maintaining the new flag be a significant drag in itself?  If not,
then I guess we're not making things any worse there, so fine.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-25 18:46               ` J. Bruce Fields
  (?)
@ 2015-06-25 22:12               ` Theodore Ts'o
  2015-06-26 13:32                 ` J. Bruce Fields
  -1 siblings, 1 reply; 28+ messages in thread
From: Theodore Ts'o @ 2015-06-25 22:12 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Does this sound reasonable?
> 
> Just to make sure I understand, the logic is something like:
> 
> 	to read the i_version:
> 
> 		inode->i_version_seen = true;
> 		return inode->i_version
> 
> 	to update the i_version:
> 
> 		/*
> 		 * If nobody's seen this value of i_version then we can
> 		 * keep using it, otherwise we need a new one:
> 		 */
> 		if (inode->i_version_seen)
> 			inode->i_version++;
> 		inode->i_version_seen = false;

Yep, that's what I was proposing.

> Looks OK to me.  As I say I'd expect i_version_seen == true to end up
> being the common case in a lot of v4 workloads, so I'm more skeptical of
> the claim of a performance improvement in the v4 case.

Well, so long as we require i_version to be committed to disk on every
single disk write, we're going to be trading off:

	* client-side performance of the advanced NFSv4 cacheing for reads
	* server-side performance for writes
	* data robustness in case of the server crashing and the client-side cache
	  getting out of sync with the server after the crash

I don't see any way around that.  (So for example, with lazy mtime
updates we wouldn't be updating the inode after every single
non-allocating write; enabling i_version updates will trash that
optimization.)

I just want to reduce to a bare minimum the performance hit in the
case where NFSv4 exports are not being used (since that is true in a
very *large* number of ext4 deployments --- i.e., every single Android
handset using ext4 :-), such that we can leave i_version updates
turned on by default.

> Could maintaining the new flag be a significant drag in itself?  If not,
> then I guess we're not making things any worse there, so fine.

I don't think so; it's a bit in the in-memory inode, so I don't think
that should be an issue.

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
  2015-06-25 22:12               ` Theodore Ts'o
@ 2015-06-26 13:32                 ` J. Bruce Fields
  0 siblings, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2015-06-26 13:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: dsterba, Liu Bo, linux-btrfs, fdmanana, kzak, linux-fsdevel, viro,
	linux-nfs, chuck.lever, mingming.cao

On Thu, Jun 25, 2015 at 06:12:57PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Looks OK to me.  As I say I'd expect i_version_seen == true to end up
> > being the common case in a lot of v4 workloads, so I'm more skeptical of
> > the claim of a performance improvement in the v4 case.
> 
> Well, so long as we require i_version to be committed to disk on every
> single disk write, we're going to be trading off:
> 
> 	* client-side performance of the advanced NFSv4 cacheing for reads
> 	* server-side performance for writes
> 	* data robustness in case of the server crashing and the client-side cache
> 	  getting out of sync with the server after the crash
> 
> I don't see any way around that.  (So for example, with lazy mtime
> updates we wouldn't be updating the inode after every single
> non-allocating write; enabling i_version updates will trash that
> optimization.)
> 
> I just want to reduce to a bare minimum the performance hit in the
> case where NFSv4 exports are not being used (since that is true in a
> very *large* number of ext4 deployments --- i.e., every single Android
> handset using ext4 :-), such that we can leave i_version updates
> turned on by default.

Definitely understood.  I think it's a good idea.

> > Could maintaining the new flag be a significant drag in itself?  If not,
> > then I guess we're not making things any worse there, so fine.
> 
> I don't think so; it's a bit in the in-memory inode, so I don't think
> that should be an issue.

OK!

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-06-26 13:32 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17  7:54 [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION Liu Bo
2015-06-17  7:54 ` [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file Liu Bo
2015-06-17 15:58   ` David Sterba
2015-06-18  3:27     ` Liu Bo
2015-06-24 18:21       ` David Sterba
2015-06-25  2:24         ` Liu Bo
2015-06-25 16:10           ` David Sterba
2015-06-17 15:33 ` [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION David Sterba
2015-06-17 15:52   ` Liu Bo
2015-06-17 17:01     ` David Sterba
2015-06-18  2:46       ` Liu Bo
2015-06-18 14:38         ` i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION) David Sterba
2015-06-19 11:44           ` Karel Zak
2015-06-19 11:44             ` Karel Zak
2015-06-22 20:42           ` Dave Chinner
2015-06-22 20:42             ` Dave Chinner
2015-06-24 17:28             ` David Sterba
2015-06-23 16:32           ` Theodore Ts'o
2015-06-24  8:23             ` Liu Bo
2015-06-24 18:02             ` David Sterba
2015-06-24 23:17               ` Theodore Ts'o
2015-06-24 23:17                 ` Theodore Ts'o
2015-06-24 23:59                 ` Dave Chinner
2015-06-24 23:59                   ` Dave Chinner
2015-06-25 18:46             ` J. Bruce Fields
2015-06-25 18:46               ` J. Bruce Fields
2015-06-25 22:12               ` Theodore Ts'o
2015-06-26 13:32                 ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.