Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] un-breaking osx-gcc ci job
@ 2024-05-09 16:22 Jeff King
  2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Jeff King @ 2024-05-09 16:22 UTC (permalink / raw
  To: git

The osx-gcc job seems to break reliably for me, which started in the
last week or so (since the last time I actually triggered CI). The third
patch fixes it, but I noticed while investigating that the job is not
even running gcc at all! That's fixed in the second patch. And the first
one is just a cleanup I found along the way.

I'm not sure that this job carries a huge amount of value over the
osx-clang one, so we might consider just ditching it. But in the
meantime, this should get things passing again.

  [1/3]: ci: drop mention of BREW_INSTALL_PACKAGES variable
  [2/3]: ci: avoid bare "gcc" for osx-gcc job
  [3/3]: ci: stop installing "gcc-13" for osx-gcc

 .github/workflows/main.yml | 3 +--
 ci/install-dependencies.sh | 2 --
 2 files changed, 1 insertion(+), 4 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable
  2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
@ 2024-05-09 16:23 ` Jeff King
  2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-05-09 16:23 UTC (permalink / raw
  To: git

The last user of this variable went away in 4a6e4b9602 (CI: remove
Travis CI support, 2021-11-23), so it's doing nothing except making it
more confusing to find out which packages _are_ installed.

Signed-off-by: Jeff King <peff@peff.net>
---
 ci/install-dependencies.sh | 2 --
 1 file changed, 2 deletions(-)

diff --git a/ci/install-dependencies.sh b/ci/install-dependencies.sh
index c196e56762..2e7688ae8b 100755
--- a/ci/install-dependencies.sh
+++ b/ci/install-dependencies.sh
@@ -69,8 +69,6 @@ macos-*)
 	export HOMEBREW_NO_AUTO_UPDATE=1 HOMEBREW_NO_INSTALL_CLEANUP=1
 	# Uncomment this if you want to run perf tests:
 	# brew install gnu-time
-	test -z "$BREW_INSTALL_PACKAGES" ||
-	brew install $BREW_INSTALL_PACKAGES
 	brew link --force gettext
 
 	mkdir -p "$CUSTOM_PATH"
-- 
2.45.0.414.g4009e73179


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
  2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
@ 2024-05-09 16:24 ` Jeff King
  2024-05-10  7:00   ` Patrick Steinhardt
  2024-05-10 20:32   ` Kyle Lippincott
  2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
  2024-05-09 16:52 ` [PATCH 0/3] un-breaking osx-gcc ci job Junio C Hamano
  3 siblings, 2 replies; 31+ messages in thread
From: Jeff King @ 2024-05-09 16:24 UTC (permalink / raw
  To: git

On macOS, a bare "gcc" (without a version) will invoke a wrapper for
clang, not actual gcc. Even when gcc is installed via homebrew, that
only provides version-specific links in /usr/local/bin (like "gcc-13"),
and never a version-agnostic "gcc" wrapper.

As far as I can tell, this has been the case for a long time, and this
osx-gcc job has largely been doing nothing. We can point it at "gcc-13",
which will pick up the homebrew-installed version.

The fix here is specific to the github workflow file, as the gitlab one
does not have a matching job.

It's a little unfortunate that we cannot just ask for the latest version
of gcc which homebrew provides, but as far as I can tell there is no
easy alias (you'd have to find the highest number gcc-* in
/usr/local/bin yourself).

Signed-off-by: Jeff King <peff@peff.net>
---
We'll eventually have to bump "gcc-13" to "gcc-14" here, and so on. I
don't think we ever cared about gcc-13 in particular; it's just that
older versions of the runner image had some ancient version which we
wanted to avoid.

 .github/workflows/main.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 5838986895..5f92a50271 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -284,7 +284,7 @@ jobs:
             cc: clang
             pool: macos-13
           - jobname: osx-gcc
-            cc: gcc
+            cc: gcc-13
             cc_package: gcc-13
             pool: macos-13
           - jobname: linux-gcc-default
-- 
2.45.0.414.g4009e73179


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
  2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
  2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
@ 2024-05-09 16:25 ` Jeff King
  2024-05-10  7:00   ` Patrick Steinhardt
  2024-05-09 16:52 ` [PATCH 0/3] un-breaking osx-gcc ci job Junio C Hamano
  3 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-05-09 16:25 UTC (permalink / raw
  To: git

Our osx-gcc job explicitly asks to install gcc-13. But since the GitHub
runner image already comes with gcc-13 installed, this is mostly doing
nothing (or in some cases it may install an incremental update over the
runner image). But worse, it recently started causing errors like:

    ==> Fetching gcc@13
    ==> Downloading https://ghcr.io/v2/homebrew/core/gcc/13/blobs/sha256:fb2403d97e2ce67eb441b54557cfb61980830f3ba26d4c5a1fe5ecd0c9730d1a
    ==> Pouring gcc@13--13.2.0.ventura.bottle.tar.gz
    Error: The `brew link` step did not complete successfully
    The formula built, but is not symlinked into /usr/local
    Could not symlink bin/c++-13
    Target /usr/local/bin/c++-13
    is a symlink belonging to gcc. You can unlink it:
      brew unlink gcc

which cause the whole CI job to bail.

I didn't track down the root cause, but I suspect it may be related to
homebrew recently switching the "gcc" default to gcc-14. And it may even
be fixed when a new runner image is released. But if we don't need to
run brew at all, it's one less thing for us to worry about.

Signed-off-by: Jeff King <peff@peff.net>
---
I don't think this was due to anything on our side. I tried to re-run
versions of git.git which had previously passed and ran into the same
issue.

I'd like to report that this let me get a successful CI run, but I'm
running into the thing where osx jobs seem to randomly hang sometimes
and hit the 6-hour timeout. But I did confirm that this lets us get to
the actual build/test, and not barf while installing dependencies.

 .github/workflows/main.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 5f92a50271..13cc0fe807 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -285,7 +285,6 @@ jobs:
             pool: macos-13
           - jobname: osx-gcc
             cc: gcc-13
-            cc_package: gcc-13
             pool: macos-13
           - jobname: linux-gcc-default
             cc: gcc
-- 
2.45.0.414.g4009e73179

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] un-breaking osx-gcc ci job
  2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
                   ` (2 preceding siblings ...)
  2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
@ 2024-05-09 16:52 ` Junio C Hamano
  3 siblings, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2024-05-09 16:52 UTC (permalink / raw
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The osx-gcc job seems to break reliably for me, which started in the
> last week or so (since the last time I actually triggered CI). The third
> patch fixes it, but I noticed while investigating that the job is not
> even running gcc at all! That's fixed in the second patch. And the first
> one is just a cleanup I found along the way.
>
> I'm not sure that this job carries a huge amount of value over the
> osx-clang one, so we might consider just ditching it. But in the
> meantime, this should get things passing again.

Thanks.

>   [1/3]: ci: drop mention of BREW_INSTALL_PACKAGES variable
>   [2/3]: ci: avoid bare "gcc" for osx-gcc job
>   [3/3]: ci: stop installing "gcc-13" for osx-gcc
>
>  .github/workflows/main.yml | 3 +--
>  ci/install-dependencies.sh | 2 --
>  2 files changed, 1 insertion(+), 4 deletions(-)
>
> -Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
@ 2024-05-10  7:00   ` Patrick Steinhardt
  2024-05-10 20:13     ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-10  7:00 UTC (permalink / raw
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1196 bytes --]

On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
[snip]
> I'd like to report that this let me get a successful CI run, but I'm
> running into the thing where osx jobs seem to randomly hang sometimes
> and hit the 6-hour timeout. But I did confirm that this lets us get to
> the actual build/test, and not barf while installing dependencies.

Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
and until now I haven't yet figured out why that is.

>  .github/workflows/main.yml | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 5f92a50271..13cc0fe807 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -285,7 +285,6 @@ jobs:
>              pool: macos-13
>            - jobname: osx-gcc
>              cc: gcc-13
> -            cc_package: gcc-13

As far as I can see this means that we don't install GCC at all anymore
via Homebrew. Does this mean that we now rely on the GCC version that is
preinstalled by Homebrew? Won't this break every time that Homebrew
changes the default GCC version?
 
I may be missing the obvious.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
@ 2024-05-10  7:00   ` Patrick Steinhardt
  2024-05-10 20:16     ` Jeff King
  2024-05-10 20:32   ` Kyle Lippincott
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-10  7:00 UTC (permalink / raw
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1401 bytes --]

On Thu, May 09, 2024 at 12:24:15PM -0400, Jeff King wrote:
> On macOS, a bare "gcc" (without a version) will invoke a wrapper for
> clang, not actual gcc. Even when gcc is installed via homebrew, that
> only provides version-specific links in /usr/local/bin (like "gcc-13"),
> and never a version-agnostic "gcc" wrapper.

That's awful.

> As far as I can tell, this has been the case for a long time, and this
> osx-gcc job has largely been doing nothing. We can point it at "gcc-13",
> which will pick up the homebrew-installed version.
> 
> The fix here is specific to the github workflow file, as the gitlab one
> does not have a matching job.

Indeed we don't. We could add it, but that certainly does not need to be
part of this patch series.

> It's a little unfortunate that we cannot just ask for the latest version
> of gcc which homebrew provides, but as far as I can tell there is no
> easy alias (you'd have to find the highest number gcc-* in
> /usr/local/bin yourself).
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> We'll eventually have to bump "gcc-13" to "gcc-14" here, and so on. I
> don't think we ever cared about gcc-13 in particular; it's just that
> older versions of the runner image had some ancient version which we
> wanted to avoid.

As an alternative we could munge PATH such that Homebrew's GCC is found
before Apple's.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-10  7:00   ` Patrick Steinhardt
@ 2024-05-10 20:13     ` Jeff King
  2024-05-11  7:17       ` Patrick Steinhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-05-10 20:13 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:

> On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> [snip]
> > I'd like to report that this let me get a successful CI run, but I'm
> > running into the thing where osx jobs seem to randomly hang sometimes
> > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > the actual build/test, and not barf while installing dependencies.
> 
> Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> and until now I haven't yet figured out why that is.

Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
would eventually fix it. ;)

It feels like a deadlock somewhere, though whether it is in our code, or
in our tests, or some system-ish issue with prove, perl, etc, I don't
know. It would be nice to catch it in the act and see what the process
tree looks like. I guess poking around in the test environment with
tmate might work, though I don't know if there's a way to get tmate
running simultaneously with the hung step (so you'd probably have to
connect, kick off the "make test" manually and hope it hangs).

> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > index 5f92a50271..13cc0fe807 100644
> > --- a/.github/workflows/main.yml
> > +++ b/.github/workflows/main.yml
> > @@ -285,7 +285,6 @@ jobs:
> >              pool: macos-13
> >            - jobname: osx-gcc
> >              cc: gcc-13
> > -            cc_package: gcc-13
> 
> As far as I can see this means that we don't install GCC at all anymore
> via Homebrew. Does this mean that we now rely on the GCC version that is
> preinstalled by Homebrew? Won't this break every time that Homebrew
> changes the default GCC version?
>  
> I may be missing the obvious.

Yes, we'll have to update to a different version when the runner image
stops carrying gcc-13. But it's not based on homebrew's default.
According to:

  https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md

the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it
would be a while before gcc-13 ages out and we have to bother bumping. I
do agree it would be nice to just use the latest gcc in the image, but I
don't think we can specify that here. I guess we could say "gcc-auto" or
something, and then the actual shell code could poke around for it.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-10  7:00   ` Patrick Steinhardt
@ 2024-05-10 20:16     ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-05-10 20:16 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git

On Fri, May 10, 2024 at 09:00:14AM +0200, Patrick Steinhardt wrote:

> > We'll eventually have to bump "gcc-13" to "gcc-14" here, and so on. I
> > don't think we ever cared about gcc-13 in particular; it's just that
> > older versions of the runner image had some ancient version which we
> > wanted to avoid.
> 
> As an alternative we could munge PATH such that Homebrew's GCC is found
> before Apple's.

Ideally, yeah, but it's not just a PATH issue. AFAICT there literally is
no "gcc" in any PATH, only the version specific ones (even if you "brew
install gcc").

So you'd need something like:

  for i in /usr/local/bin/gcc-*
  do
	# rely on sorting of glob to do last-one-wins
	gcc=$i
  done

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
  2024-05-10  7:00   ` Patrick Steinhardt
@ 2024-05-10 20:32   ` Kyle Lippincott
  2024-05-10 20:48     ` Junio C Hamano
  2024-05-10 22:02     ` Jeff King
  1 sibling, 2 replies; 31+ messages in thread
From: Kyle Lippincott @ 2024-05-10 20:32 UTC (permalink / raw
  To: Jeff King; +Cc: git

On Thu, May 9, 2024 at 9:24 AM Jeff King <peff@peff.net> wrote:
>
> On macOS, a bare "gcc" (without a version) will invoke a wrapper for
> clang, not actual gcc. Even when gcc is installed via homebrew, that
> only provides version-specific links in /usr/local/bin (like "gcc-13"),
> and never a version-agnostic "gcc" wrapper.
>
> As far as I can tell, this has been the case for a long time, and this
> osx-gcc job has largely been doing nothing.

If it's been doing nothing (which I interpreted as "it's doing the
same thing as osx-clang"), and we've not noticed any issues with that,
do we need the job at all? Should we just rely on clang and not test
with gcc on macOS, since it's not a compiler that's provided by the
platform anymore?

> We can point it at "gcc-13",
> which will pick up the homebrew-installed version.
>
> The fix here is specific to the github workflow file, as the gitlab one
> does not have a matching job.
>
> It's a little unfortunate that we cannot just ask for the latest version
> of gcc which homebrew provides, but as far as I can tell there is no
> easy alias (you'd have to find the highest number gcc-* in
> /usr/local/bin yourself).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> We'll eventually have to bump "gcc-13" to "gcc-14" here, and so on. I
> don't think we ever cared about gcc-13 in particular; it's just that
> older versions of the runner image had some ancient version which we
> wanted to avoid.
>
>  .github/workflows/main.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 5838986895..5f92a50271 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -284,7 +284,7 @@ jobs:
>              cc: clang
>              pool: macos-13
>            - jobname: osx-gcc
> -            cc: gcc
> +            cc: gcc-13
>              cc_package: gcc-13
>              pool: macos-13
>            - jobname: linux-gcc-default
> --
> 2.45.0.414.g4009e73179
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-10 20:32   ` Kyle Lippincott
@ 2024-05-10 20:48     ` Junio C Hamano
  2024-05-10 22:02     ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2024-05-10 20:48 UTC (permalink / raw
  To: Kyle Lippincott; +Cc: Jeff King, git

Kyle Lippincott <spectral@google.com> writes:

> On Thu, May 9, 2024 at 9:24 AM Jeff King <peff@peff.net> wrote:
>>
>> On macOS, a bare "gcc" (without a version) will invoke a wrapper for
>> clang, not actual gcc. Even when gcc is installed via homebrew, that
>> only provides version-specific links in /usr/local/bin (like "gcc-13"),
>> and never a version-agnostic "gcc" wrapper.
>>
>> As far as I can tell, this has been the case for a long time, and this
>> osx-gcc job has largely been doing nothing.
>
> If it's been doing nothing (which I interpreted as "it's doing the
> same thing as osx-clang"), and we've not noticed any issues with that,
> do we need the job at all? Should we just rely on clang and not test
> with gcc on macOS, since it's not a compiler that's provided by the
> platform anymore?

A very tempting suggestion.  I do not see any problems with the
direction.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-10 20:32   ` Kyle Lippincott
  2024-05-10 20:48     ` Junio C Hamano
@ 2024-05-10 22:02     ` Jeff King
  2024-05-10 22:47       ` Junio C Hamano
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-05-10 22:02 UTC (permalink / raw
  To: Kyle Lippincott; +Cc: git

On Fri, May 10, 2024 at 01:32:15PM -0700, Kyle Lippincott wrote:

> On Thu, May 9, 2024 at 9:24 AM Jeff King <peff@peff.net> wrote:
> >
> > On macOS, a bare "gcc" (without a version) will invoke a wrapper for
> > clang, not actual gcc. Even when gcc is installed via homebrew, that
> > only provides version-specific links in /usr/local/bin (like "gcc-13"),
> > and never a version-agnostic "gcc" wrapper.
> >
> > As far as I can tell, this has been the case for a long time, and this
> > osx-gcc job has largely been doing nothing.
> 
> If it's been doing nothing (which I interpreted as "it's doing the
> same thing as osx-clang"), and we've not noticed any issues with that,
> do we need the job at all? Should we just rely on clang and not test
> with gcc on macOS, since it's not a compiler that's provided by the
> platform anymore?

Your interpretation is correct; it was just doing the same thing as
osx-clang. As I said earlier, I'd be fine with just dropping it. It's
possible that there is value in testing there with gcc (that we've been
missing out on), but to know that we'd have to understand the original
reasons it was added.

Looks like it came from 889cacb689 (ci: configure GitHub Actions for
CI/PR, 2020-04-11), but that was just porting from the Azure pipelines
config. That system got its osx_gcc job from 27be78173d (Add a build
definition for Azure DevOps, 2019-01-29), but that in turn was just
porting the Travis config. The Travis job came from 522354d70f (Add
Travis CI support, 2015-11-27), and there's no specific rationale given.
But it used a build matrix of (os, compiler), so we got all four of
{linux,osx}-{gcc,clang}. So it mostly seems like "more is better" and
that was the easiest way to define it.

I do think there's value in testing with both clang and gcc in
general[1]. And there is _some_ code which is compiled only on macos
and not elsewhere. So this would be our only chance for gcc to see it.
But it seems like a pretty small return for an entire parallel job.
Especially as I do not think it has uncovered anything interesting in
the past (even when it was working).

-Peff

[1] Another quirk is that we run the whole test suite for both
    compilers, which is probably overkill. The main value in comparing
    gcc vs clang is that we don't use any constructs that the compiler
    complains about. It's _possible_ for there to be a construct that
    the compiler does not notice but which causes a runtime difference
    (say, undefined behavior which happens to work out on one compiler),
    but I think we're again hitting diminishing returns.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-10 22:02     ` Jeff King
@ 2024-05-10 22:47       ` Junio C Hamano
  2024-05-11 17:21         ` Patrick Steinhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2024-05-10 22:47 UTC (permalink / raw
  To: Jeff King; +Cc: Kyle Lippincott, git

Jeff King <peff@peff.net> writes:

> I do think there's value in testing with both clang and gcc in
> general[1]. And there is _some_ code which is compiled only on macos
> and not elsewhere. So this would be our only chance for gcc to see it.
> But it seems like a pretty small return for an entire parallel job.
> Especially as I do not think it has uncovered anything interesting in
> the past (even when it was working).

100% agreed.

> [1] Another quirk is that we run the whole test suite for both
>     compilers, which is probably overkill. The main value in comparing
>     gcc vs clang is that we don't use any constructs that the compiler
>     complains about. It's _possible_ for there to be a construct that
>     the compiler does not notice but which causes a runtime difference
>     (say, undefined behavior which happens to work out on one compiler),
>     but I think we're again hitting diminishing returns.

Yeah, that is a very good point.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-10 20:13     ` Jeff King
@ 2024-05-11  7:17       ` Patrick Steinhardt
  2024-05-16 12:36         ` Patrick Steinhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-11  7:17 UTC (permalink / raw
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 3224 bytes --]

On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> 
> > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > [snip]
> > > I'd like to report that this let me get a successful CI run, but I'm
> > > running into the thing where osx jobs seem to randomly hang sometimes
> > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > the actual build/test, and not barf while installing dependencies.
> > 
> > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > and until now I haven't yet figured out why that is.
> 
> Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> would eventually fix it. ;)
> 
> It feels like a deadlock somewhere, though whether it is in our code, or
> in our tests, or some system-ish issue with prove, perl, etc, I don't
> know. It would be nice to catch it in the act and see what the process
> tree looks like. I guess poking around in the test environment with
> tmate might work, though I don't know if there's a way to get tmate
> running simultaneously with the hung step (so you'd probably have to
> connect, kick off the "make test" manually and hope it hangs).

My hunch tells me that it's the Perforce tests -- after all, this is
where the jobs get stuck, too. In "lib-git-p4.sh" we already document
that p4d is known to crash at times, and overall the logic to spawn the
server is quite convoluted.

I did try to get more useful logs yesterday. But as usual, once you
_want_ to reproduce a failure like this is doesn't happen anymore.

> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > > index 5f92a50271..13cc0fe807 100644
> > > --- a/.github/workflows/main.yml
> > > +++ b/.github/workflows/main.yml
> > > @@ -285,7 +285,6 @@ jobs:
> > >              pool: macos-13
> > >            - jobname: osx-gcc
> > >              cc: gcc-13
> > > -            cc_package: gcc-13
> > 
> > As far as I can see this means that we don't install GCC at all anymore
> > via Homebrew. Does this mean that we now rely on the GCC version that is
> > preinstalled by Homebrew? Won't this break every time that Homebrew
> > changes the default GCC version?
> >  
> > I may be missing the obvious.
> 
> Yes, we'll have to update to a different version when the runner image
> stops carrying gcc-13. But it's not based on homebrew's default.
> According to:
> 
>   https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md
> 
> the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it
> would be a while before gcc-13 ages out and we have to bother bumping. I
> do agree it would be nice to just use the latest gcc in the image, but I
> don't think we can specify that here. I guess we could say "gcc-auto" or
> something, and then the actual shell code could poke around for it.

Ah, thanks for the explanation. I was worried that things might break at
arbitrary points in time. But if this is only containde to whenever we
upgrade the runner image, then I don't see this as much of a problem.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-10 22:47       ` Junio C Hamano
@ 2024-05-11 17:21         ` Patrick Steinhardt
  2024-05-16  7:19           ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-11 17:21 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Jeff King, Kyle Lippincott, git

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

On Fri, May 10, 2024 at 03:47:39PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
[snip]
> > [1] Another quirk is that we run the whole test suite for both
> >     compilers, which is probably overkill. The main value in comparing
> >     gcc vs clang is that we don't use any constructs that the compiler
> >     complains about. It's _possible_ for there to be a construct that
> >     the compiler does not notice but which causes a runtime difference
> >     (say, undefined behavior which happens to work out on one compiler),
> >     but I think we're again hitting diminishing returns.
> 
> Yeah, that is a very good point.

On Linux, we have the "pedantic" job that runs Fedora and only compiles
the sources with DEVOPTS=pedantic without running any of the tests. We
could do the same on macOS.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-11 17:21         ` Patrick Steinhardt
@ 2024-05-16  7:19           ` Jeff King
  2024-05-16  7:27             ` Jeff King
  2024-05-16  9:54             ` Patrick Steinhardt
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff King @ 2024-05-16  7:19 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: Junio C Hamano, Kyle Lippincott, git

On Sat, May 11, 2024 at 07:21:28PM +0200, Patrick Steinhardt wrote:

> On Fri, May 10, 2024 at 03:47:39PM -0700, Junio C Hamano wrote:
> > Jeff King <peff@peff.net> writes:
> [snip]
> > > [1] Another quirk is that we run the whole test suite for both
> > >     compilers, which is probably overkill. The main value in comparing
> > >     gcc vs clang is that we don't use any constructs that the compiler
> > >     complains about. It's _possible_ for there to be a construct that
> > >     the compiler does not notice but which causes a runtime difference
> > >     (say, undefined behavior which happens to work out on one compiler),
> > >     but I think we're again hitting diminishing returns.
> > 
> > Yeah, that is a very good point.
> 
> On Linux, we have the "pedantic" job that runs Fedora and only compiles
> the sources with DEVOPTS=pedantic without running any of the tests. We
> could do the same on macOS.

Yeah, I think the infrastructure is there (looks like just resetting
$run_tests). We probably could stand to use it in more places. E.g., is
there even value in running the tests for linux-gcc and linux-clang?
It's _possible_ for there to be a run-time difference in the compiler
outputs, but we may be hitting diminishing returns. The main value I
think is just seeing what the compilers complain about.

But I dunno. This thread argues there is value in running the tests with
the separate compiler:

  https://lore.kernel.org/git/pull.266.git.gitgitgadget@gmail.com/

which I guess would argue for doing the same for osx-clang and osx-gcc
(if the latter continues to exist).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-16  7:19           ` Jeff King
@ 2024-05-16  7:27             ` Jeff King
  2024-05-16  9:54             ` Patrick Steinhardt
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-05-16  7:27 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: Junio C Hamano, Kyle Lippincott, git

On Thu, May 16, 2024 at 03:19:30AM -0400, Jeff King wrote:

> On Sat, May 11, 2024 at 07:21:28PM +0200, Patrick Steinhardt wrote:
> 
> > On Fri, May 10, 2024 at 03:47:39PM -0700, Junio C Hamano wrote:
> > > Jeff King <peff@peff.net> writes:
> > [snip]
> > > > [1] Another quirk is that we run the whole test suite for both
> > > >     compilers, which is probably overkill. The main value in comparing
> > > >     gcc vs clang is that we don't use any constructs that the compiler
> > > >     complains about. It's _possible_ for there to be a construct that
> > > >     the compiler does not notice but which causes a runtime difference
> > > >     (say, undefined behavior which happens to work out on one compiler),
> > > >     but I think we're again hitting diminishing returns.
> > > 
> > > Yeah, that is a very good point.
> > 
> > On Linux, we have the "pedantic" job that runs Fedora and only compiles
> > the sources with DEVOPTS=pedantic without running any of the tests. We
> > could do the same on macOS.
> 
> Yeah, I think the infrastructure is there (looks like just resetting
> $run_tests). We probably could stand to use it in more places. E.g., is
> there even value in running the tests for linux-gcc and linux-clang?
> It's _possible_ for there to be a run-time difference in the compiler
> outputs, but we may be hitting diminishing returns. The main value I
> think is just seeing what the compilers complain about.

Actually, scratch that. I forgot we dropped linux-clang last summer in
d88d727143 (ci: drop linux-clang job, 2023-06-01), since it was mostly
redundant with the sanitizer builds (which use clang).

There is still "linux-gcc" and "linux-gcc-default", the former of which
uses an older version of the compiler. And again, it's not clear to me
that running the actual tests is uncovering useful stuff there.
Certainly it is possible, but I wonder if that is the best bang for the
buck (e.g., if we wanted to spend a "make test" worth of CPU, then
something like osx-sha256 would IMHO be more likely to uncover something
useful).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-16  7:19           ` Jeff King
  2024-05-16  7:27             ` Jeff King
@ 2024-05-16  9:54             ` Patrick Steinhardt
  2024-05-17  8:19               ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-16  9:54 UTC (permalink / raw
  To: Jeff King; +Cc: Junio C Hamano, Kyle Lippincott, git

[-- Attachment #1: Type: text/plain, Size: 2898 bytes --]

On Thu, May 16, 2024 at 03:19:30AM -0400, Jeff King wrote:
> On Sat, May 11, 2024 at 07:21:28PM +0200, Patrick Steinhardt wrote:
> 
> > On Fri, May 10, 2024 at 03:47:39PM -0700, Junio C Hamano wrote:
> > > Jeff King <peff@peff.net> writes:
> > [snip]
> > > > [1] Another quirk is that we run the whole test suite for both
> > > >     compilers, which is probably overkill. The main value in comparing
> > > >     gcc vs clang is that we don't use any constructs that the compiler
> > > >     complains about. It's _possible_ for there to be a construct that
> > > >     the compiler does not notice but which causes a runtime difference
> > > >     (say, undefined behavior which happens to work out on one compiler),
> > > >     but I think we're again hitting diminishing returns.
> > > 
> > > Yeah, that is a very good point.
> > 
> > On Linux, we have the "pedantic" job that runs Fedora and only compiles
> > the sources with DEVOPTS=pedantic without running any of the tests. We
> > could do the same on macOS.
> 
> Yeah, I think the infrastructure is there (looks like just resetting
> $run_tests). We probably could stand to use it in more places. E.g., is
> there even value in running the tests for linux-gcc and linux-clang?
> It's _possible_ for there to be a run-time difference in the compiler
> outputs, but we may be hitting diminishing returns. The main value I
> think is just seeing what the compilers complain about.

That's certainly the biggest part, yeah. But I have been hitting lots of
compiler-dependent behaviour. This is mostly in the area of bugs though,
where for example toolchain A may initialize variables on the stack to
all zeroes while toolchain B does not.

I guess this is mostly a question of defaults though, and I think it is
partially influenced by the overall toolchain environment as configured
by my distro. Especially hardening options are for example quite likely
to lead to different behaviour.

I'm not sure whether this is sufficient reason on its own to warrant
testing with several toolchains. But we can easily combine this with
additional tuning knobs. Two separate test jobs with GCC and Clang are
comparatively boring. But if we make it GCC+SHA1 and Clang+SHA256 then
it becomes more interesting.

So I think dropping the compiler coverage completely is rather pointless
because we already run multiple different jobs per platform anyway. But
we should investigate whether we can cleverly combine those so that we
do not need a separate jobs just to test a specific compiler.

Patrick

> But I dunno. This thread argues there is value in running the tests with
> the separate compiler:
> 
>   https://lore.kernel.org/git/pull.266.git.gitgitgadget@gmail.com/
> 
> which I guess would argue for doing the same for osx-clang and osx-gcc
> (if the latter continues to exist).
> 
> -Peff

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-11  7:17       ` Patrick Steinhardt
@ 2024-05-16 12:36         ` Patrick Steinhardt
  2024-05-17  8:11           ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-16 12:36 UTC (permalink / raw
  To: Jeff King; +Cc: git, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 3989 bytes --]

On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote:
> On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> > 
> > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > > [snip]
> > > > I'd like to report that this let me get a successful CI run, but I'm
> > > > running into the thing where osx jobs seem to randomly hang sometimes
> > > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > > the actual build/test, and not barf while installing dependencies.
> > > 
> > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > > and until now I haven't yet figured out why that is.
> > 
> > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> > would eventually fix it. ;)
> > 
> > It feels like a deadlock somewhere, though whether it is in our code, or
> > in our tests, or some system-ish issue with prove, perl, etc, I don't
> > know. It would be nice to catch it in the act and see what the process
> > tree looks like. I guess poking around in the test environment with
> > tmate might work, though I don't know if there's a way to get tmate
> > running simultaneously with the hung step (so you'd probably have to
> > connect, kick off the "make test" manually and hope it hangs).
> 
> My hunch tells me that it's the Perforce tests -- after all, this is
> where the jobs get stuck, too. In "lib-git-p4.sh" we already document
> that p4d is known to crash at times, and overall the logic to spawn the
> server is quite convoluted.
> 
> I did try to get more useful logs yesterday. But as usual, once you
> _want_ to reproduce a failure like this is doesn't happen anymore.

I was spending (or rather wasting?) some more time on this. With the
below diff I was able to get a list of processes running after ~50
minutes:

    diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
    index 98dda42045..d5570b59d3 100755
    --- a/ci/run-build-and-tests.sh
    +++ b/ci/run-build-and-tests.sh
    @@ -51,8 +51,15 @@ esac
     group Build make
     if test -n "$run_tests"
     then
    -	group "Run tests" make test ||
    +	(
    +		sleep 3200 &&
    +		mkdir -p t/failed-test-artifacts &&
    +		ps -A >t/failed-test-artifacts/ps 2>&1
    +	) &
    +	pid=$!
    +	group "Run tests" gtimeout 1h make test ||
        handle_failed_tests
    +	kill "$pid"
     fi
     check_unignored_build_artifacts

I trimmed that process list to the following set of relevant processes:

  PID TTY           TIME CMD
 5196 ??         0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5242 ??         0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5244 ??         0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out
 5245 ??         0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 7235 ??         0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail
 7265 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin
 7276 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8

So it seems like the issue is t9211, and the hang happens in "scalar
clone warns when background maintenance fails" specifically. What
exactly the root cause is I have no clue though. Maybe an fsmonitor
race, maybe something else entirely. Hard to say as I have never seen
this happen on any other platform than macOS, and I do not have access
to a Mac myself.

The issue also doesn't seem to occur when running t9211 on its own, but
only when running the full test suite. This may further indicate that
there is a race condition, where the additional load improves the
likelihood of it. Or there is bad interaction with another test.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-16 12:36         ` Patrick Steinhardt
@ 2024-05-17  8:11           ` Jeff King
  2024-05-17  8:25             ` Patrick Steinhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-05-17  8:11 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: git, Derrick Stolee

On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:

> I was spending (or rather wasting?) some more time on this. With the
> below diff I was able to get a list of processes running after ~50
> minutes:

I was going to say "good, now I don't have to waste time on it". But
your findings only nerd-sniped me into digging more. ;)

> So it seems like the issue is t9211, and the hang happens in "scalar
> clone warns when background maintenance fails" specifically. What
> exactly the root cause is I have no clue though. Maybe an fsmonitor
> race, maybe something else entirely. Hard to say as I have never seen
> this happen on any other platform than macOS, and I do not have access
> to a Mac myself.
> 
> The issue also doesn't seem to occur when running t9211 on its own, but
> only when running the full test suite. This may further indicate that
> there is a race condition, where the additional load improves the
> likelihood of it. Or there is bad interaction with another test.

I can reproduce it at will and relatively quickly using "--stress" with
t9211. I pushed up a hacky commit that removes all CI jobs except for
os-clang, and it stops short of running the build/tests and opens a
shell using tmate. For reference (though you'd need to work out
something similar for GitLab).

  https://github.com/peff/git/commit/f825fa36ed95bed414b0d6d9e8b21799e2e167e4

And then just:

  make -j8
  cd t
  ./t9211-scalar-clone.sh --stress

Give it a minute or two, and you'll see most of the jobs have hung, with
one or two "winners" continuing (once most of them are hanging, the load
is low enough that the race doesn't happen). So you'll see 3.17, 3.18,
3.19, and so on, indicating that job 3 is still going and completing its
19th run. But everything else is stuck and stops producing output.

You can likewise see processes in "ps" that are a few minutes old, which
is another way to find the stuck ones. And I get the same three
processes as you: scalar clone, fetch, and fsmonitor--daemon.

And here's where I ran into tooling issues.

Normally I'd "strace -p" to see what the hung processes are doing. We
don't have that on macOS. Doing "sudo dtruss -p" runs without complaint,
but it looks like it doesn't report on the current syscall (where we're
presumably blocking).

I installed gdb, which does seem to work, but attaching to the running
processes doesn't show a useful backtrace (even after making sure to
build with "-g -O0", and confirming that regular "gdb ./git" works OK).

One can guess that scalar is in waitpid() waiting for git-fetch. But
what's fetch waiting on? The other side of upload-pack is dead.
According to lsof, it does have a unix socket open to fsmonitor. So
maybe it's trying to read there?

Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing
fetch unstick scalar. So either my guesses above are wrong, or there's
something else weird causing them to hang.

I imagine there may be better tools to poke at things, but I'm at the
limits of my macOS knowledge. But maybe the recipe above is enough for
somebody more clueful to recreate and investigate the situation (it
probably would also be easy to just run the --stress script locally if
somebody actually has a mac).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-16  9:54             ` Patrick Steinhardt
@ 2024-05-17  8:19               ` Jeff King
  2024-05-17  8:33                 ` Patrick Steinhardt
  2024-05-17 16:59                 ` Junio C Hamano
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff King @ 2024-05-17  8:19 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: Junio C Hamano, Kyle Lippincott, git

On Thu, May 16, 2024 at 11:54:44AM +0200, Patrick Steinhardt wrote:

> That's certainly the biggest part, yeah. But I have been hitting lots of
> compiler-dependent behaviour. This is mostly in the area of bugs though,
> where for example toolchain A may initialize variables on the stack to
> all zeroes while toolchain B does not.

I've definitely run into differing runtime outcomes for undefined
behavior stuff like that. But in my experience most of that is
consistently found by ASan/UBSan (which we do run in CI these days).

It's possible there are cases that those sanitizers don't catch but that
cause differing behavior. But I can't think of one off the top of my
head where that has happened.

> I'm not sure whether this is sufficient reason on its own to warrant
> testing with several toolchains. But we can easily combine this with
> additional tuning knobs. Two separate test jobs with GCC and Clang are
> comparatively boring. But if we make it GCC+SHA1 and Clang+SHA256 then
> it becomes more interesting.

Yeah. Combining orthogonal properties into a single job lets us cover
both (for the common case of success on both) with less CPU. But:

  - it can sometimes be hard to figure out which of the properties was
    responsible for a failure. That was the very subject of the thread I
    referenced earlier, where "linux-gcc" was "use gcc" and also "set
    lots of knobs".

  - they might not actually be orthogonal. If you care about checking
    runtime behavior in the output of two compilers, then that _could_
    manifest only in the sha256 code. Or as you get into more
    properties, they may overlap in other ways. I think reftable+sha256
    is an interesting (eventual) combo to test on top of reftable+sha1.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-17  8:11           ` Jeff King
@ 2024-05-17  8:25             ` Patrick Steinhardt
  2024-05-17 11:30               ` Patrick Steinhardt
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-17  8:25 UTC (permalink / raw
  To: Jeff King; +Cc: git, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
[snip]
> One can guess that scalar is in waitpid() waiting for git-fetch. But
> what's fetch waiting on? The other side of upload-pack is dead.
> According to lsof, it does have a unix socket open to fsmonitor. So
> maybe it's trying to read there?

That was also my guess. I tried whether disabling fsmonitor via
`core.fsmonitor=false` helps, but that did not seem to be the case.
Either because it didn't have the desired effect, or because the root
cause is not fsmonitor. No idea which of both it is.

> Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing
> fetch unstick scalar. So either my guesses above are wrong, or there's
> something else weird causing them to hang.
> 
> I imagine there may be better tools to poke at things, but I'm at the
> limits of my macOS knowledge. But maybe the recipe above is enough for
> somebody more clueful to recreate and investigate the situation (it
> probably would also be easy to just run the --stress script locally if
> somebody actually has a mac).

Let's hope it is :)

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-17  8:19               ` Jeff King
@ 2024-05-17  8:33                 ` Patrick Steinhardt
  2024-05-17 16:59                 ` Junio C Hamano
  1 sibling, 0 replies; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-17  8:33 UTC (permalink / raw
  To: Jeff King; +Cc: Junio C Hamano, Kyle Lippincott, git

[-- Attachment #1: Type: text/plain, Size: 2886 bytes --]

On Fri, May 17, 2024 at 04:19:09AM -0400, Jeff King wrote:
> On Thu, May 16, 2024 at 11:54:44AM +0200, Patrick Steinhardt wrote:
> 
> > That's certainly the biggest part, yeah. But I have been hitting lots of
> > compiler-dependent behaviour. This is mostly in the area of bugs though,
> > where for example toolchain A may initialize variables on the stack to
> > all zeroes while toolchain B does not.
> 
> I've definitely run into differing runtime outcomes for undefined
> behavior stuff like that. But in my experience most of that is
> consistently found by ASan/UBSan (which we do run in CI these days).
> 
> It's possible there are cases that those sanitizers don't catch but that
> cause differing behavior. But I can't think of one off the top of my
> head where that has happened.

True, these should be sufficient indeed.

> > I'm not sure whether this is sufficient reason on its own to warrant
> > testing with several toolchains. But we can easily combine this with
> > additional tuning knobs. Two separate test jobs with GCC and Clang are
> > comparatively boring. But if we make it GCC+SHA1 and Clang+SHA256 then
> > it becomes more interesting.
> 
> Yeah. Combining orthogonal properties into a single job lets us cover
> both (for the common case of success on both) with less CPU. But:
> 
>   - it can sometimes be hard to figure out which of the properties was
>     responsible for a failure. That was the very subject of the thread I
>     referenced earlier, where "linux-gcc" was "use gcc" and also "set
>     lots of knobs".

That's true. But for me the problem typically is that you need to be
aware that the job uses different properties in the first place -- this
is quite hidden away. Figuring out that a job uses "main" as default
branch just because it is called "linux-gcc" is quite hard unless you
are aware of how exactly our CI systems work. And besides being hard to
discover, it's also really fragile.

I wish that we got rid of relying on job names and made this more
discoverable. The obvious way to do so would be to instead declare the
`GIT_TEST_` variables in the CI definitions, which would make them easy
to spot and change. But it of course has the big downside that it is now
quite easy for the different CI platforms to diverge.

>   - they might not actually be orthogonal. If you care about checking
>     runtime behavior in the output of two compilers, then that _could_
>     manifest only in the sha256 code. Or as you get into more
>     properties, they may overlap in other ways. I think reftable+sha256
>     is an interesting (eventual) combo to test on top of reftable+sha1.

Yes, I really want to have reftable+sha256, as well. I didn't feel like
adding it back then because we already have a ton of jobs, and adding
another job felt like pushing the limits.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-17  8:25             ` Patrick Steinhardt
@ 2024-05-17 11:30               ` Patrick Steinhardt
  2024-05-26  6:34                 ` Philip
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-17 11:30 UTC (permalink / raw
  To: Jeff King; +Cc: git, Derrick Stolee, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 2405 bytes --]

On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote:
> On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
> [snip]
> > One can guess that scalar is in waitpid() waiting for git-fetch. But
> > what's fetch waiting on? The other side of upload-pack is dead.
> > According to lsof, it does have a unix socket open to fsmonitor. So
> > maybe it's trying to read there?
> 
> That was also my guess. I tried whether disabling fsmonitor via
> `core.fsmonitor=false` helps, but that did not seem to be the case.
> Either because it didn't have the desired effect, or because the root
> cause is not fsmonitor. No idea which of both it is.

The root cause actually is the fsmonitor. I was using your tmate hack to
SSH into one of the failed jobs, and there had been 7 instances of the
fsmonitor lurking. After killing all of them the job got unstuck and ran
to completion.

The reason why setting `core.fsmonitor=false` is ineffective is because
in "scalar.c" we always configure `core.fsmonitor=true` in the repo
config and thus override the setting. I was checking whether it would
make sense to defer enabling the fsmonitor until after the fetch and
checkout have concluded. But funny enough, the below patch caused the
pipeline to now hang deterministically.

Puzzled.

Patrick

diff --git a/scalar.c b/scalar.c
index 7234049a1b..67f85c7adc 100644
--- a/scalar.c
+++ b/scalar.c
@@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure)
                     config[i].key, config[i].value);
    }
 
-	if (have_fsmonitor_support()) {
-		struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
-		if (set_scalar_config(&fsmonitor, reconfigure))
-			return error(_("could not configure %s=%s"),
-				     fsmonitor.key, fsmonitor.value);
-	}
-
    /*
     * The `log.excludeDecoration` setting is special because it allows
     * for multiple values.
@@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv)
    if (res)
        goto cleanup;
 
+	if (have_fsmonitor_support()) {
+		struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
+		if (set_scalar_config(&fsmonitor, 0))
+			return error(_("could not configure %s=%s"),
+				     fsmonitor.key, fsmonitor.value);
+	}
+
    res = register_dir();
 
 cleanup:

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-17  8:19               ` Jeff King
  2024-05-17  8:33                 ` Patrick Steinhardt
@ 2024-05-17 16:59                 ` Junio C Hamano
  2024-05-23  9:10                   ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2024-05-17 16:59 UTC (permalink / raw
  To: Jeff King; +Cc: Patrick Steinhardt, Kyle Lippincott, git

Jeff King <peff@peff.net> writes:

> Yeah. Combining orthogonal properties into a single job lets us cover
> both (for the common case of success on both) with less CPU. But:
>
>   - it can sometimes be hard to figure out which of the properties was
>     responsible for a failure. That was the very subject of the thread I
>     referenced earlier, where "linux-gcc" was "use gcc" and also "set
>     lots of knobs".
>
>   - they might not actually be orthogonal. If you care about checking
>     runtime behavior in the output of two compilers, then that _could_
>     manifest only in the sha256 code. Or as you get into more
>     properties, they may overlap in other ways. I think reftable+sha256
>     is an interesting (eventual) combo to test on top of reftable+sha1.

We could consider permuting, then?  If we (for the sake of
simplicity) had two jobs available, one compiled with GCC and the
other compiled with clang, we can enumerate other properties
(e.g. <SHA-1 vs SHA-256>, <reftable vs reffiles>) into pairs, and in
one run, GCC may be running SHA-1+reffiles while clang is running
SHA-256+reftable, and in another run, GCC may be running
SHA-256+reffiles, etc. --- eventually we cover all four combinations
(admittedly for different commits).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-17 16:59                 ` Junio C Hamano
@ 2024-05-23  9:10                   ` Jeff King
  2024-05-23 15:35                     ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2024-05-23  9:10 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Patrick Steinhardt, Kyle Lippincott, git

On Fri, May 17, 2024 at 09:59:35AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Yeah. Combining orthogonal properties into a single job lets us cover
> > both (for the common case of success on both) with less CPU. But:
> >
> >   - it can sometimes be hard to figure out which of the properties was
> >     responsible for a failure. That was the very subject of the thread I
> >     referenced earlier, where "linux-gcc" was "use gcc" and also "set
> >     lots of knobs".
> >
> >   - they might not actually be orthogonal. If you care about checking
> >     runtime behavior in the output of two compilers, then that _could_
> >     manifest only in the sha256 code. Or as you get into more
> >     properties, they may overlap in other ways. I think reftable+sha256
> >     is an interesting (eventual) combo to test on top of reftable+sha1.
> 
> We could consider permuting, then?  If we (for the sake of
> simplicity) had two jobs available, one compiled with GCC and the
> other compiled with clang, we can enumerate other properties
> (e.g. <SHA-1 vs SHA-256>, <reftable vs reffiles>) into pairs, and in
> one run, GCC may be running SHA-1+reffiles while clang is running
> SHA-256+reftable, and in another run, GCC may be running
> SHA-256+reffiles, etc. --- eventually we cover all four combinations
> (admittedly for different commits).

That's a neat idea to get eventual coverage. I have a feeling it would
be a pain in practice, though, because now the CI results aren't quite
deterministic. So if commit X introduces a bug in some combination, we
might not find out until later, and seeing that X passed all tests
doesn't absolve it of responsibility anymore.

Likewise, I often have to re-run CI to get more data, or to see if a
failure is a flake. If it changed what it ran that would be confusing
(though I guess you could use the commit hash as the random "seed" for
deciding which permutation to run).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job
  2024-05-23  9:10                   ` Jeff King
@ 2024-05-23 15:35                     ` Junio C Hamano
  0 siblings, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2024-05-23 15:35 UTC (permalink / raw
  To: Jeff King; +Cc: Patrick Steinhardt, Kyle Lippincott, git

Jeff King <peff@peff.net> writes:

> ... be a pain in practice, though, because now the CI results aren't quite
> deterministic. So if commit X introduces a bug in some combination, we
> might not find out until later, and seeing that X passed all tests
> doesn't absolve it of responsibility anymore.

Right.  A poor idea retracted.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-17 11:30               ` Patrick Steinhardt
@ 2024-05-26  6:34                 ` Philip
  2024-05-26 19:23                   ` Junio C Hamano
  2024-05-29  9:27                   ` Jeff King
  0 siblings, 2 replies; 31+ messages in thread
From: Philip @ 2024-05-26  6:34 UTC (permalink / raw
  To: Patrick Steinhardt; +Cc: Jeff King, git, Derrick Stolee, Jeff Hostetler

Part of the problem seems to be that the Github actions runner has a bug
on OSX: https://github.com/actions/runner/issues/884

Based on investigating this for a while by setting up a self-hosted actions
runner, it seems to have to do with a broken pipe triggering incomplete
output capture / termination detection by either Github Action Runner (
see issue thread) or maybe even Dotnet Core's
System.Diagnostics.Process functionality.

As for the actual failing test, t9211, what I got on my machine was a
failure during clone: `unknown repository extension found: refstorage`.
In the trash directory, the .git/config did specify that extension.
Perhaps some interference coming from
t9500-gitweb-standalone-no-errors.sh, since it invokes:

> git config extensions.refstorage "$refstorage"

Cheers,
Phil

On Fri, May 17, 2024 at 7:30 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote:
> > On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> > > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
> > [snip]
> > > One can guess that scalar is in waitpid() waiting for git-fetch. But
> > > what's fetch waiting on? The other side of upload-pack is dead.
> > > According to lsof, it does have a unix socket open to fsmonitor. So
> > > maybe it's trying to read there?
> >
> > That was also my guess. I tried whether disabling fsmonitor via
> > `core.fsmonitor=false` helps, but that did not seem to be the case.
> > Either because it didn't have the desired effect, or because the root
> > cause is not fsmonitor. No idea which of both it is.
>
> The root cause actually is the fsmonitor. I was using your tmate hack to
> SSH into one of the failed jobs, and there had been 7 instances of the
> fsmonitor lurking. After killing all of them the job got unstuck and ran
> to completion.
>
> The reason why setting `core.fsmonitor=false` is ineffective is because
> in "scalar.c" we always configure `core.fsmonitor=true` in the repo
> config and thus override the setting. I was checking whether it would
> make sense to defer enabling the fsmonitor until after the fetch and
> checkout have concluded. But funny enough, the below patch caused the
> pipeline to now hang deterministically.
>
> Puzzled.
>
> Patrick
>
> diff --git a/scalar.c b/scalar.c
> index 7234049a1b..67f85c7adc 100644
> --- a/scalar.c
> +++ b/scalar.c
> @@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure)
>                      config[i].key, config[i].value);
>     }
>
> -       if (have_fsmonitor_support()) {
> -               struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
> -               if (set_scalar_config(&fsmonitor, reconfigure))
> -                       return error(_("could not configure %s=%s"),
> -                                    fsmonitor.key, fsmonitor.value);
> -       }
> -
>     /*
>      * The `log.excludeDecoration` setting is special because it allows
>      * for multiple values.
> @@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv)
>     if (res)
>         goto cleanup;
>
> +       if (have_fsmonitor_support()) {
> +               struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
> +               if (set_scalar_config(&fsmonitor, 0))
> +                       return error(_("could not configure %s=%s"),
> +                                    fsmonitor.key, fsmonitor.value);
> +       }
> +
>     res = register_dir();
>
>  cleanup:

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-26  6:34                 ` Philip
@ 2024-05-26 19:23                   ` Junio C Hamano
  2024-05-27  5:12                     ` Patrick Steinhardt
  2024-05-29  9:27                   ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2024-05-26 19:23 UTC (permalink / raw
  To: Philip; +Cc: Patrick Steinhardt, Jeff King, git, Derrick Stolee,
	Jeff Hostetler

Philip <philip.c.peterson@gmail.com> writes:

> Part of the problem seems to be that the Github actions runner has a bug
> on OSX: https://github.com/actions/runner/issues/884
>
> Based on investigating this for a while by setting up a self-hosted actions
> runner, it seems to have to do with a broken pipe triggering incomplete
> output capture / termination detection by either Github Action Runner (
> see issue thread) or maybe even Dotnet Core's
> System.Diagnostics.Process functionality.

Thanks for digging into this.

> As for the actual failing test, t9211, what I got on my machine was a
> failure during clone: `unknown repository extension found: refstorage`.
> In the trash directory, the .git/config did specify that extension.
> Perhaps some interference coming from
> t9500-gitweb-standalone-no-errors.sh, since it invokes:
>
>> git config extensions.refstorage "$refstorage"

Puzzled.  We run t9211 in "t/trash directory.t9211-whatever/"
directory with its own repository, so that what t9500 does in its
own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/"
directory would not interfere with it to begin with.  How would that
setting seep through to an unrelated test run next door?  It's not
like they share TCP port number or anything like that?




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-26 19:23                   ` Junio C Hamano
@ 2024-05-27  5:12                     ` Patrick Steinhardt
  0 siblings, 0 replies; 31+ messages in thread
From: Patrick Steinhardt @ 2024-05-27  5:12 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Philip, Jeff King, git, Derrick Stolee, Jeff Hostetler

[-- Attachment #1: Type: text/plain, Size: 2191 bytes --]

On Sun, May 26, 2024 at 12:23:18PM -0700, Junio C Hamano wrote:
> Philip <philip.c.peterson@gmail.com> writes:
> 
> > Part of the problem seems to be that the Github actions runner has a bug
> > on OSX: https://github.com/actions/runner/issues/884
> >
> > Based on investigating this for a while by setting up a self-hosted actions
> > runner, it seems to have to do with a broken pipe triggering incomplete
> > output capture / termination detection by either Github Action Runner (
> > see issue thread) or maybe even Dotnet Core's
> > System.Diagnostics.Process functionality.
> 
> Thanks for digging into this.

Indeed, thanks for digging.

In any case, whatever it is, it cannot be exclusively due to a bug with
GitHub given that we see the same issue happening with GitLab CI.

> > As for the actual failing test, t9211, what I got on my machine was a
> > failure during clone: `unknown repository extension found: refstorage`.
> > In the trash directory, the .git/config did specify that extension.
> > Perhaps some interference coming from
> > t9500-gitweb-standalone-no-errors.sh, since it invokes:
> >
> >> git config extensions.refstorage "$refstorage"
> 
> Puzzled.  We run t9211 in "t/trash directory.t9211-whatever/"
> directory with its own repository, so that what t9500 does in its
> own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/"
> directory would not interfere with it to begin with.  How would that
> setting seep through to an unrelated test run next door?  It's not
> like they share TCP port number or anything like that?

This error looks somewhat weird to me. Why should anything part of Git
not recognize the refstorage extension? It almost feels as if there were
different versions of Git being used.

I'm quite positive by now that the error is somewhere in the fsmonitor.
I was double checking whether there is an issue with reuse of some of
its sockets across test suites. But given that the tests have different
HOMEs and that they have different repository paths, I couldn't really
find anything that might be reused across invocations of scalar(1) or
the git-fsmonitor--daemon(1).

Patrtick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
  2024-05-26  6:34                 ` Philip
  2024-05-26 19:23                   ` Junio C Hamano
@ 2024-05-29  9:27                   ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff King @ 2024-05-29  9:27 UTC (permalink / raw
  To: Philip; +Cc: Patrick Steinhardt, git, Derrick Stolee, Jeff Hostetler

On Sun, May 26, 2024 at 02:34:05AM -0400, Philip wrote:

> Part of the problem seems to be that the Github actions runner has a bug
> on OSX: https://github.com/actions/runner/issues/884

I think this may be a red herring. The issue described there is that
processes in the runner environment are spawned with SIGPIPE ignored.
But one of the things we do in common-main.c is make sure SIGPIPE is set
back to the default. So unless the problem is in a non-git program (say,
an issue with the shell, or prove, or something), I don't think this
would affect us.

That issue also goes back to 2020, and anecdotally, I would say that the
hangs we are seeing started in the last 6 months or so.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-05-29  9:27 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:16     ` Jeff King
2024-05-10 20:32   ` Kyle Lippincott
2024-05-10 20:48     ` Junio C Hamano
2024-05-10 22:02     ` Jeff King
2024-05-10 22:47       ` Junio C Hamano
2024-05-11 17:21         ` Patrick Steinhardt
2024-05-16  7:19           ` Jeff King
2024-05-16  7:27             ` Jeff King
2024-05-16  9:54             ` Patrick Steinhardt
2024-05-17  8:19               ` Jeff King
2024-05-17  8:33                 ` Patrick Steinhardt
2024-05-17 16:59                 ` Junio C Hamano
2024-05-23  9:10                   ` Jeff King
2024-05-23 15:35                     ` Junio C Hamano
2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:13     ` Jeff King
2024-05-11  7:17       ` Patrick Steinhardt
2024-05-16 12:36         ` Patrick Steinhardt
2024-05-17  8:11           ` Jeff King
2024-05-17  8:25             ` Patrick Steinhardt
2024-05-17 11:30               ` Patrick Steinhardt
2024-05-26  6:34                 ` Philip
2024-05-26 19:23                   ` Junio C Hamano
2024-05-27  5:12                     ` Patrick Steinhardt
2024-05-29  9:27                   ` Jeff King
2024-05-09 16:52 ` [PATCH 0/3] un-breaking osx-gcc ci job Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).