Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* t0032 fails on NFS mounts
@ 2024-03-12 15:10 Chuck Lever
  2024-03-12 17:11 ` rsbecker
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Chuck Lever @ 2024-03-12 15:10 UTC (permalink / raw
  To: git; +Cc: chuck.lever, linux-nfs

Hi-

I've checked out "maint".

Unit test t0032 fails when run on an NFS mount:

[vagrant@cel t]$ ./t0032-reftable-unittest.sh 
not ok 1 - unittests
#	
#		TMPDIR=$(pwd) && export TMPDIR &&
#		test-tool reftable
#	
# failed 1 among 1 test(s)
1..1
[vagrant@cel t]$

But not on XFS:

[vagrant@cel t]$ ./t0032-reftable-unittest.sh 
ok 1 - unittests
# passed all 1 test(s)
1..1
[vagrant@cel t]$ cd ..
[vagrant@cel git]$ ./git --version
git version 2.44.0
[vagrant@cel git]$

v2.43.2 seems to work OK.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: t0032 fails on NFS mounts
  2024-03-12 15:10 t0032 fails on NFS mounts Chuck Lever
@ 2024-03-12 17:11 ` rsbecker
  2024-03-21 15:50   ` Patrick Steinhardt
  2024-03-13  7:20 ` Jeff King
  2024-03-21 15:39 ` [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics Patrick Steinhardt
  2 siblings, 1 reply; 10+ messages in thread
From: rsbecker @ 2024-03-12 17:11 UTC (permalink / raw
  To: 'Chuck Lever', git; +Cc: linux-nfs

On Tuesday, March 12, 2024 11:10 AM, Chuck Lever wrote:
>I've checked out "maint".
>
>Unit test t0032 fails when run on an NFS mount:
>
>[vagrant@cel t]$ ./t0032-reftable-unittest.sh not ok 1 - unittests
>#
>#		TMPDIR=$(pwd) && export TMPDIR &&
>#		test-tool reftable
>#
># failed 1 among 1 test(s)
>1..1
>[vagrant@cel t]$
>
>But not on XFS:
>
>[vagrant@cel t]$ ./t0032-reftable-unittest.sh ok 1 - unittests # passed all
1 test(s)
>1..1
>[vagrant@cel t]$ cd ..
>[vagrant@cel git]$ ./git --version
>git version 2.44.0
>[vagrant@cel git]$
>
>v2.43.2 seems to work OK.

I have seen a similar effect on a standard POSIX file system (NonStop) when
run in a special platform container (Pathway). It does not happen for me
when run from bash directly. This may be something other than an NFS effect,
or unrelated to what I observed. I'll be monitoring this sub-test for
repeats.
--Randall


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: t0032 fails on NFS mounts
  2024-03-12 15:10 t0032 fails on NFS mounts Chuck Lever
  2024-03-12 17:11 ` rsbecker
@ 2024-03-13  7:20 ` Jeff King
  2024-03-15  0:35   ` Patrick Steinhardt
  2024-03-21 15:14   ` Patrick Steinhardt
  2024-03-21 15:39 ` [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics Patrick Steinhardt
  2 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2024-03-13  7:20 UTC (permalink / raw
  To: Chuck Lever; +Cc: Patrick Steinhardt, git, linux-nfs

+cc Patrick for reftable

On Tue, Mar 12, 2024 at 11:10:29AM -0400, Chuck Lever wrote:

> Unit test t0032 fails when run on an NFS mount:
> 
> [vagrant@cel t]$ ./t0032-reftable-unittest.sh 
> not ok 1 - unittests
> #	
> #		TMPDIR=$(pwd) && export TMPDIR &&
> #		test-tool reftable
> #	
> # failed 1 among 1 test(s)
> 1..1

The output for this test script is particularly unhelpful because it's
not using our test harness at all, but just running a bunch of internal
tests using a single program.

Running with "-v" should give more details about what's failing.

I set up a basic loopback server like:

  mkdir /mnt/{server,client}
  exportfs -o rw,sync 127.0.0.1:/mnt/server
  mount -t nfs 127.0.0.1:/mnt/server /mnt/client

and then ran:

  ./t0032-reftable-unittest.sh --root=/mnt/client -v

Looks like it fails at:

  running test_reftable_stack_compaction_concurrent_clean
  reftable/stack_test.c: 1063: failed assertion count_dir_entries(dir) == 2
  Aborted

> v2.43.2 seems to work OK.

For me, too. Bisecting shows the problem appearing in 4f36b8597c
(reftable/stack: fix race in up-to-date check, 2024-01-18).

-Peff

PS That test seems to run ~20x slower on NFS versus directly on ext4.
   I'd expect a little overhead, but that's quite a bit.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: t0032 fails on NFS mounts
  2024-03-13  7:20 ` Jeff King
@ 2024-03-15  0:35   ` Patrick Steinhardt
  2024-03-15 15:41     ` Junio C Hamano
  2024-03-21 15:14   ` Patrick Steinhardt
  1 sibling, 1 reply; 10+ messages in thread
From: Patrick Steinhardt @ 2024-03-15  0:35 UTC (permalink / raw
  To: Jeff King; +Cc: Chuck Lever, git, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

On Wed, Mar 13, 2024 at 03:20:52AM -0400, Jeff King wrote:
> +cc Patrick for reftable
> 
> On Tue, Mar 12, 2024 at 11:10:29AM -0400, Chuck Lever wrote:
> 
> > Unit test t0032 fails when run on an NFS mount:
> > 
> > [vagrant@cel t]$ ./t0032-reftable-unittest.sh 
> > not ok 1 - unittests
> > #	
> > #		TMPDIR=$(pwd) && export TMPDIR &&
> > #		test-tool reftable
> > #	
> > # failed 1 among 1 test(s)
> > 1..1
> 
> The output for this test script is particularly unhelpful because it's
> not using our test harness at all, but just running a bunch of internal
> tests using a single program.
> 
> Running with "-v" should give more details about what's failing.
> 
> I set up a basic loopback server like:
> 
>   mkdir /mnt/{server,client}
>   exportfs -o rw,sync 127.0.0.1:/mnt/server
>   mount -t nfs 127.0.0.1:/mnt/server /mnt/client
> 
> and then ran:
> 
>   ./t0032-reftable-unittest.sh --root=/mnt/client -v
> 
> Looks like it fails at:
> 
>   running test_reftable_stack_compaction_concurrent_clean
>   reftable/stack_test.c: 1063: failed assertion count_dir_entries(dir) == 2
>   Aborted
> 
> > v2.43.2 seems to work OK.
> 
> For me, too. Bisecting shows the problem appearing in 4f36b8597c
> (reftable/stack: fix race in up-to-date check, 2024-01-18).
> 
> -Peff
> 
> PS That test seems to run ~20x slower on NFS versus directly on ext4.
>    I'd expect a little overhead, but that's quite a bit.

Unless somebody else beats me to it I'll investigate this next week when
I'm back in the office. Thanks for the report and for digging into this
issue!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: t0032 fails on NFS mounts
  2024-03-15  0:35   ` Patrick Steinhardt
@ 2024-03-15 15:41     ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2024-03-15 15:41 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: Jeff King, Chuck Lever, git, linux-nfs

Patrick Steinhardt <ps@pks.im> writes:

> Unless somebody else beats me to it I'll investigate this next week when
> I'm back in the office. Thanks for the report and for digging into this
> issue!

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: t0032 fails on NFS mounts
  2024-03-13  7:20 ` Jeff King
  2024-03-15  0:35   ` Patrick Steinhardt
@ 2024-03-21 15:14   ` Patrick Steinhardt
  1 sibling, 0 replies; 10+ messages in thread
From: Patrick Steinhardt @ 2024-03-21 15:14 UTC (permalink / raw
  To: Jeff King; +Cc: Chuck Lever, git, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2590 bytes --]

On Wed, Mar 13, 2024 at 03:20:52AM -0400, Jeff King wrote:
> +cc Patrick for reftable
> 
> On Tue, Mar 12, 2024 at 11:10:29AM -0400, Chuck Lever wrote:
> 
> > Unit test t0032 fails when run on an NFS mount:
> > 
> > [vagrant@cel t]$ ./t0032-reftable-unittest.sh 
> > not ok 1 - unittests
> > #	
> > #		TMPDIR=$(pwd) && export TMPDIR &&
> > #		test-tool reftable
> > #	
> > # failed 1 among 1 test(s)
> > 1..1
> 
> The output for this test script is particularly unhelpful because it's
> not using our test harness at all, but just running a bunch of internal
> tests using a single program.
> 
> Running with "-v" should give more details about what's failing.
> 
> I set up a basic loopback server like:
> 
>   mkdir /mnt/{server,client}
>   exportfs -o rw,sync 127.0.0.1:/mnt/server
>   mount -t nfs 127.0.0.1:/mnt/server /mnt/client
> 
> and then ran:
> 
>   ./t0032-reftable-unittest.sh --root=/mnt/client -v
> 
> Looks like it fails at:
> 
>   running test_reftable_stack_compaction_concurrent_clean
>   reftable/stack_test.c: 1063: failed assertion count_dir_entries(dir) == 2
>   Aborted
> 
> > v2.43.2 seems to work OK.
> 
> For me, too. Bisecting shows the problem appearing in 4f36b8597c
> (reftable/stack: fix race in up-to-date check, 2024-01-18).

I think this is actually benign. I set a breakpoint in the respective
test right before double-checking our conditions, and curiously I got
back the following list of files:

    ./stack_test-1027.QJBpnd
    ./stack_test-1027.QJBpnd/0x000000000001-0x000000000003-dad7ac80.ref
    ./stack_test-1027.QJBpnd/.nfs000000000001729f00001e11
    ./stack_test-1027.QJBpnd/tables.list

Notice the ".nfs*" thing? This is a temporary file managed by the NFS
client that maintains delete-on-close behaviour because we have unlinked
the file while it was still open [1]. But of course we count that file
when executing `count_dir_entries()`, and thus we arrive at an
unexpected number of files.

I will send a patch to fix the test.

> PS That test seems to run ~20x slower on NFS versus directly on ext4.
>    I'd expect a little overhead, but that's quite a bit.

I'm not all that surprised here given that the reftable library is quite
prone to stat(3P)ing the "tables.list" file, and potentially re-reading
it. I kind of suspect that this is what's going on. An alternative
explanation might be that mmap'ing over NFS is really slow.

Anyway, I will have a deeper look at this and see where we spend all the
time.

Patrick

[1]: https://nfs.sourceforge.net/#faq_d2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics
  2024-03-12 15:10 t0032 fails on NFS mounts Chuck Lever
  2024-03-12 17:11 ` rsbecker
  2024-03-13  7:20 ` Jeff King
@ 2024-03-21 15:39 ` Patrick Steinhardt
  2024-03-21 17:35   ` Junio C Hamano
  2024-03-22 15:15   ` Toon Claes
  2 siblings, 2 replies; 10+ messages in thread
From: Patrick Steinhardt @ 2024-03-21 15:39 UTC (permalink / raw
  To: git; +Cc: Chuck Lever, rsbecker, Jeff King, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 4296 bytes --]

It was reported that the reftable unit tests in t0032 fail with the
following assertion when running on top of NFS:

    running test_reftable_stack_compaction_concurrent_clean
    reftable/stack_test.c: 1063: failed assertion count_dir_entries(dir) == 2
    Aborted

Setting a breakpoint immediately before the assertion in fact shows the
following list of files:

    ./stack_test-1027.QJBpnd
    ./stack_test-1027.QJBpnd/0x000000000001-0x000000000003-dad7ac80.ref
    ./stack_test-1027.QJBpnd/.nfs000000000001729f00001e11
    ./stack_test-1027.QJBpnd/tables.list

Note the weird ".nfs*" file? This file is maintained by NFS clients in
order to emulate delete-after-last-close semantics that we rely on in
the reftable code [1]. Instead of unlinking the file right away and
keeping it open in the client, the NFS client will rename it to ".nfs*"
and then delete that temporary file when the last reference to it gets
dropped. Quoting the NFS FAQ:

    > D2. What is a "silly rename"? Why do these .nfsXXXXX files keep
    > showing up?
    >
    > A. Unix applications often open a scratch file and then unlink it.
    > They do this so that the file is not visible in the file system name
    > space to any other applications, and so that the system will
    > automatically clean up (delete) the file when the application exits.
    > This is known as "delete on last close", and is a tradition among
    > Unix applications.
    >
    > Because of the design of the NFS protocol, there is no way for a
    > file to be deleted from the name space but still remain in use by an
    > application. Thus NFS clients have to emulate this using what
    > already exists in the protocol. If an open file is unlinked, an NFS
    > client renames it to a special name that looks like ".nfsXXXXX".
    > This "hides" the file while it remains in use. This is known as a
    > "silly rename." Note that NFS servers have nothing to do with this
    > behavior.

This of course throws off the assertion that we got exactly two files in
that directory.

The test in question triggers this behaviour by holding two open file
descriptors to the "tables.list" file. One of the references is because
we are about to append to the stack, whereas the other reference is
because we want to compact it. As the compaction has just finished we
already rewrote "tables.list" to point to the new contents, but the
other file descriptor pointing to the old version is still open. Thus we
trigger the delete-after-last-close emulation.

Furthermore, it was reported that this behaviour only triggers with
4f36b8597c (reftable/stack: fix race in up-to-date check, 2024-01-18).
This is expected as well because it is the first point in time where we
actually keep the "tables.list" file descriptor open for the stat cache.

Fix this bug by skipping over any files that start with a leading dot
when counting files. While we could explicitly check for a prefix of
".nfs", other network file systems like SMB for example do the same
trickery but with a ".smb" prefix. In any case though, this loosening of
the assertion should be fine given that the reftable library would never
write files with leading dots by itself.

[1]: https://nfs.sourceforge.net/#faq_d2

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/stack_test.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/reftable/stack_test.c b/reftable/stack_test.c
index 509f486623..621578fda8 100644
--- a/reftable/stack_test.c
+++ b/reftable/stack_test.c
@@ -38,7 +38,17 @@ static int count_dir_entries(const char *dirname)
 		return 0;
 
 	while ((d = readdir(dir))) {
-		if (!strcmp(d->d_name, "..") || !strcmp(d->d_name, "."))
+		/*
+		 * Besides skipping over "." and "..", we also need to
+		 * skip over other files that have a leading ".". This
+		 * is due to behaviour of NFS, which will rename files
+		 * to ".nfs*" to emulate delete-on-last-close.
+		 *
+		 * In any case this should be fine as the reftable
+		 * library will never write files with leading dots
+		 * anyway.
+		 */
+		if (starts_with(d->d_name, "."))
 			continue;
 		len++;
 	}
-- 
2.44.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: t0032 fails on NFS mounts
  2024-03-12 17:11 ` rsbecker
@ 2024-03-21 15:50   ` Patrick Steinhardt
  0 siblings, 0 replies; 10+ messages in thread
From: Patrick Steinhardt @ 2024-03-21 15:50 UTC (permalink / raw
  To: rsbecker; +Cc: 'Chuck Lever', git, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]

On Tue, Mar 12, 2024 at 01:11:15PM -0400, rsbecker@nexbridge.com wrote:
> On Tuesday, March 12, 2024 11:10 AM, Chuck Lever wrote:
> >I've checked out "maint".
> >
> >Unit test t0032 fails when run on an NFS mount:
> >
> >[vagrant@cel t]$ ./t0032-reftable-unittest.sh not ok 1 - unittests
> >#
> >#		TMPDIR=$(pwd) && export TMPDIR &&
> >#		test-tool reftable
> >#
> ># failed 1 among 1 test(s)
> >1..1
> >[vagrant@cel t]$
> >
> >But not on XFS:
> >
> >[vagrant@cel t]$ ./t0032-reftable-unittest.sh ok 1 - unittests # passed all
> 1 test(s)
> >1..1
> >[vagrant@cel t]$ cd ..
> >[vagrant@cel git]$ ./git --version
> >git version 2.44.0
> >[vagrant@cel git]$
> >
> >v2.43.2 seems to work OK.
> 
> I have seen a similar effect on a standard POSIX file system (NonStop) when
> run in a special platform container (Pathway). It does not happen for me
> when run from bash directly. This may be something other than an NFS effect,
> or unrelated to what I observed. I'll be monitoring this sub-test for
> repeats.

Interesting. Do you know at what point in time this happened? Was it for
one of the releases or any of the in-between states? Also, does Pathway
mess with the stat(3P) info somehow?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics
  2024-03-21 15:39 ` [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics Patrick Steinhardt
@ 2024-03-21 17:35   ` Junio C Hamano
  2024-03-22 15:15   ` Toon Claes
  1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2024-03-21 17:35 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git, Chuck Lever, rsbecker, Jeff King, linux-nfs

Patrick Steinhardt <ps@pks.im> writes:

>  	while ((d = readdir(dir))) {
> -		if (!strcmp(d->d_name, "..") || !strcmp(d->d_name, "."))
> +		/*
> +		 * Besides skipping over "." and "..", we also need to
> +		 * skip over other files that have a leading ".". This
> +		 * is due to behaviour of NFS, which will rename files
> +		 * to ".nfs*" to emulate delete-on-last-close.
> +		 *
> +		 * In any case this should be fine as the reftable
> +		 * library will never write files with leading dots
> +		 * anyway.
> +		 */
> +		if (starts_with(d->d_name, "."))
>  			continue;

Sounds good.  We should count what we positively consider what we
wrote, not random cruft somebody else might have dropped in the
directory, and this is a good first thing to do.

Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics
  2024-03-21 15:39 ` [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics Patrick Steinhardt
  2024-03-21 17:35   ` Junio C Hamano
@ 2024-03-22 15:15   ` Toon Claes
  1 sibling, 0 replies; 10+ messages in thread
From: Toon Claes @ 2024-03-22 15:15 UTC (permalink / raw
  To: Patrick Steinhardt, git; +Cc: Chuck Lever, rsbecker, Jeff King, linux-nfs

Patrick Steinhardt <ps@pks.im> writes:

> Fix this bug by skipping over any files that start with a leading dot
> when counting files. While we could explicitly check for a prefix of
> ".nfs", other network file systems like SMB for example do the same
> trickery but with a ".smb" prefix. In any case though, this loosening of
> the assertion should be fine given that the reftable library would never
> write files with leading dots by itself.

I'm fully supportive of this, as this will also fix any issues possibly
caused by .DS_Store files created by Finder on macOS, although it's very
unlikely they will be created in these tests.

--
Toon

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-03-22 15:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-12 15:10 t0032 fails on NFS mounts Chuck Lever
2024-03-12 17:11 ` rsbecker
2024-03-21 15:50   ` Patrick Steinhardt
2024-03-13  7:20 ` Jeff King
2024-03-15  0:35   ` Patrick Steinhardt
2024-03-15 15:41     ` Junio C Hamano
2024-03-21 15:14   ` Patrick Steinhardt
2024-03-21 15:39 ` [PATCH] reftable: fix tests being broken by NFS' delete-after-close semantics Patrick Steinhardt
2024-03-21 17:35   ` Junio C Hamano
2024-03-22 15:15   ` Toon Claes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).